An automated protocol to construct flexibility parameters for classical forcefields: applications to metal–organic frameworks

In this work, forcefield flexibility parameters were constructed and validated for more than 100 metal–organic frameworks (MOFs). We used atom typing to identify bond types, angle types, and dihedral types associated with bond stretches, angle bends, dihedral torsions, and other flexibility interactions. Our work used Manz's angle-bending and dihedral-torsion model potentials. For a crystal structure containing Natoms in its unit cell, the number of independent flexibility interactions is 3(Natoms − 1). Because the number of bonds, angles, and dihedrals is normally much larger than 3(Natoms − 1), these internal coordinates are redundant. To reduce (but not eliminate) this redundancy, our protocol prunes dihedral types in a way that preserves symmetry equivalency. Next, each dihedral type is classified as non-rotatable, hindered, rotatable, or linear. We introduce a smart selection method that identifies which particular torsion modes are important for each rotatable dihedral type. Then, we computed the force constants for all flexibility interactions together via LASSO regression (i.e., regularized linear least-squares fitting) of the training dataset. LASSO automatically identifies and removes unimportant forcefield interactions. For each MOF, the reference dataset was quantum-mechanically-computed in VASP via DFT with dispersion and included: (i) finite-displacement calculations along every independent atom translation mode, (ii) geometries randomly sampled via ab initio molecular dynamics (AIMD), (iii) the optimized ground-state geometry using experimental lattice parameters, and (iv) rigid torsion scans for each rotatable dihedral type. After training, the flexibility model was validated across geometries that were not part of the training dataset. For each MOF, we computed the goodness of fit (R-squared value) and the root-mean-squared error (RMSE) separately for the training and validation datasets. We compared flexibility models with and without bond–bond cross terms. Even without cross terms, the model yielded R-squared values of 0.910 (avg across all MOFs) ± 0.018 (st. dev.) for atom-in-material forces in the validation datasets. Our SAVESTEPS protocol should find widespread applications to parameterize flexible forcefields for material datasets. We performed molecular dynamics simulations using these flexibility parameters to compute heat capacities and thermal expansion coefficients for two MOFs.


Introduction
Optimizing forceelds for classical molecular dynamics and Monte Carlo simulations of materials is a pragmatic task focusing on practical aspects of usability and accuracy.In this work, we focus on applications to porous solids such as metalorganic frameworks (MOFs).Several different approaches have been developed to optimize the exibility parameters (e.g., bond Chemical & Materials Engineering, New Mexico State University, Las Cruces, NM 88001, USA.E-mail: tmanz@nmsu.edu† Electronic supplementary information (ESI) available: A PDF le containing: (a) supplementary data plots and tables, (b) additional computational details, and (c) images visualizing the atomic positions in the crystal structure of each MOF.A spreadsheet containing tables listing detailed information for each MOF: (a) CoRE MOF refcode, QMOF ID, number of atoms, and reduced chemical formula, (b) number of bond stretches, angle bends, dihedral torsions, Urey-Bradley interactions, and bond-bond cross terms, (c) numbers of ADDT non-rotatable/hindered, CADT non-rotatable/hindered, ADDT rotatable, CADT rotatable, and ADDT linear dihedral instances for each MOF, (d) list of smart selected modes for each rotatable dihedral type, (e) number non-zero force constants of each type (stretches, bends, torsions) for each MOF, (f) R-squared and RMSE values for training and validation datasets for each MOF using either the average or individual equilibrium values for each exibility type, including results with and without bond-bond cross terms, (g) exibility parameters optimization computational time for each MOF, (h) for quadrant 1, results are listed both with and without dihedral pruning.A 7-zip archive containing: (i) selected input les for each MOF and (ii) output les listing all of the exibility parameter values (i.e., optimized force constant values and equilibrium geometric parameter values) and regression statistics (for training datasets, validation dataset, and atom-wise statistics) for each MOF.See DOI: https://doi.org/10.1039/d4ra01859astretches, angle bends, dihedral torsions, etc.) used to construct such forceelds via tting to quantum-mechanically-computed reference data.2][3][4][5][6] Dubbeldam et al. recently reviewed parameterization schemes for constructing exible forceelds for MOFs. 7,8In pioneering works, several authors introduced rst-principles-derived exible forceelds for specic MOFs. 9,10n 'adoption-plus-tweaking' approaches, the exibility parameter values for a MOF's organic linkers are adopted from a prior forceeld (such as an organic or biomolecular or generic/ universal forceeld), then combined with a few new parameters (e.g., to describe the metal-ligand coupling or other interactions), and then tweaked to reproduce a handful of desired experimental or computed properties.2][13][14][15][16][17][18][19] However, they are only partial re-optimizations and not full optimizations of the exibility parameters' values.This article's focus is on approaches that fully optimize the exibility parameters' values rather than 'adoption-plus-tweaking' approaches that partially re-optimize them.
Partial Hessian-tting strategies (such as the Seminario method 20,21 ) that attempt to optimize the exibility force constants sequentially one-at-a-time rather than simultaneously are generally ill-advised.When the active internal coordinates used are redundant, the corresponding exibility terms are coupled together and do not vary independently of each other.For this reason, the corresponding force constants must be optimized simultaneously rather than sequentially one-at-atime.A previously published attempt to optimize the exibility parameters for a MOF using the Seminario method failed. 22Specically, the Seminario method oen gives anglebending force constants that are too stiff, sometimes being as much as a factor of two too large. 20,22trategies that only t the full Hessian 23 are not generally robust, because they only sample geometries on the potential energy landscape that are differentially close to the optimized ground-state geometry.This problem can only be xed by also including in the training dataset some (non-Hessian) geometries that are far away from the optimized ground-state geometry.
Several authors used genetic or evolutionary optimization algorithms to optimize forceeld exibility parameters for specic MOFs. 1,246][27] In the MOF-FF approach, terms including preset non-bonded parameter values (e.g., atomic charges and van der Waals (VDW) parameters) are included in the Hessian and energy expressions when the exibility parameter values are optimized. 25,26The MOF-FF approach uses the quantummechanically-computed optimized geometry and Hessian as target reference data to t the forceeld's exibility parameters. 25,26Since the Hessian corresponds to small displacements about the equilibrium geometry, it appears that the large displacements associated with rotational barriers are insufficiently sampled in the MOF-FF parameterization protocol.For this reason, the dihedral torsion terms may not be accurately (or sufficiently) sampled in the MOF-FF parameterization protocol.
Gabrieli et al. used force matching to optimize exibility parameters for the ZIF-8 MOF, the silicalite zeolite, and the molecules methane and carbon dioxide. 28Their protocol involved the following steps.First, they performed ab initio molecular dynamics (AIMD) calculations using density functional theory (DFT).Then, they used a constrained search optimization algorithm (specically, L-BFGS-B) to calculate the exibility parameter values that minimized the sum of squared differences between the DFT-computed and exibility-modelcomputed atom-in-material forces across their training set of AIMD geometries.
The QuickFF approach rst determines dihedral multiplicities and dihedral resting values, then it performs a series of quantum-mechanical calculations for perturbation trajectories along the corresponding internal coordinate (i.e., bond length or bond angle) for each bond stretch and angle-bending term in the forceeld to compute the corresponding 'resting value' of the bond length or bond angle, and nally it uses least-squares tting between the ab initio Hessian and the forceeld's Hessian to optimize all force constant values. 29While the original QuickFF protocol used non-periodic cluster models to represent periodic crystals, an updated QuickFF protocol was subsequently published that can use fully periodic models. 30he updated QuickFF protocol ts the mass-weighted Hessian instead of the non-mass-weighted Hessian, and it can include cross terms and/or anharmonic terms in the forceeld. 30The updated QuickFF protocol is a sequence of six major steps that involve optimizations and re-optimizations (aka tune-ups). 30A key feature of the QuickFF protocol is that terms including preset non-bonded parameter values (e.g., atomic charges and VDW parameters) are included in the Hessian and energy expressions when the exibility parameter values are optimized. 29,302][33][34][35][36][37][38][39][40][41][42][43][44][45][46][47][48] According to the published descriptions, the QuickFF protocol does not currently treat dihedral torsions rigorously but instead uses a lone cosine mode potential for each dihedral, where each ABCD dihedral is assigned a multiplicity m ABCD . 29,30Obviously, many dihedrals cannot be described by such a restricted potential form.Those dihedrals that could not be described by such a simplied potential were neglected, and this may cause the parameterized forceeld to be inaccurate. 29,30ubbeldam and coworkers developed exible forceelds that were optimized to reproduce the elastic response properties or volume-versus-temperature curve of MOFs. 7,11,12,49These can be referred as 'top-down' approaches that focus on bulk response properties as opposed to 'bottom-up' approaches that focus on forces and motions of individual atoms and chemical groups within the material.
In the present article, we develop a different exibility parameterization strategy that is based on Force Field Functional Theory (FFFT).As described in a companion article, FFFT studies "topics related to the functional representation of nonreactive forceelds to achieve various desirable properties". 50Specic theoretical advances of FFFT that are directly relevant to the present article include: (1) A new ansatz for separating the bonded potential energy from the nonbonded potential energy within a bonded cluster that does not introduce any new approximations and enables bonded parameters to be optimized using linear regression instead of requiring nonlinear regression. 50(Examples of a bonded cluster include a molecule or a MOF.) Manz's ansatz separates the bonded potential energy from the nonbonded potential energy in such a way that the 'resting values' of internal coordinates appearing in the forceeld's exibility terms are identically equal to the equilibrium values of those internal coordinates in the isolated bonded cluster's optimized ground-state geometry. 50The forceeld's total potential energy is represented as 50 where RA is the position of atom A and ℤ A is its atomic number (aka element number).
(2) Most importantly, Manz' ansatz denes the intracluster bonded interactions in such a way that the atom-in-material forces for extremely small (i.e., innitesimal) displacements relative to the isolated bonded cluster's optimized ground-state geometry do not depend on any intracluster nonbonded interactions. 50This allows the intracluster bonded interactions to be rigorously parameterized up to second order (i.e., within a harmonic approximation) without having to include the intracluster nonbonded interactions. 50 (Manz's ansatz can be used to optimize the exibility parameter terms so that the forceeld rigorously describes the anharmonicities (i.e., thirdorder and higher-order derivatives of the energy), but this requires including intracluster nonbonded interactions when the bonded parameter values are optimized 50 ).The present article focuses exclusively on parameterizing the intracluster bonded interactions (i.e., parameterizing the exibility terms) up to second-order derivatives in the energy.The intracluster nonbonded interactions and intercluster nonbonded interactions have been partly studied in several previous publications (co)authored by one of us [51][52][53][54][55][56][57][58][59][60][61][62][63][64] and will be further studied in some of our upcoming publications.
(3) New angle-bending and dihedral torsion model potentials that are nearly universal, improve accuracy, improve numerical stability, and have a small number of adjustable parameters. 50Most importantly, these model potentials avoid derivative discontinuities (i.e., force discontinuities) associated with linear bond angles. 50) "Forceeld design that guarantees the reference groundstate geometry is exactly reproduced as an equilibrium structure on the forceeld's potential energy landscape". 50In this work, the reference ground state geometry consisted of the experimental lattice parameters dening the unit cell's size and shape plus DFT_with_dispersion optimized atom-in-material positions.
(5) "Well-designed methods to parameterize the forceeld from quantum-mechanically-computed and (optionally) experimental reference data". 50The SAVESTEPS protocol introduced in the present article accomplishes this.
(6) "Computationally efficient embedded feature selection that identies and removes unimportant forceeld terms". 50ithin the present article, we developed three important embedded feature selection techniques: (a) dihedral pruning as described in Sections 5.4.3 and 8.6.1,(b) smart selection of rotatable dihedral modes as described in Sections 7.1 and 8.5, and (c) least absolute shrinkage and selection operator (LASSO 65,66 ) regression as described in Sections 7.4 and 8.6.
A key goal of this article is to create an automated workow that allows a large number of materials to be processed efficiently.To the best of our knowledge, this is the rst time rstprinciples-derived exibility parameters have been optimized in a system-specic manner for more than one hundred MOFs in a single study.To date, 'generic/universal' forceelds (e.g., UFF 67 and UFF4MOF 68,69 ) that attempt a common parameterization across multiple material types have not been accurate for describing dihedral torsions in MOFs, even though they do a reasonably good job of predicting equilibrium bond lengths, bond angles, and bulk moduli in many materials [68][69][70] (however, some modications to UFF4MOF are needed to treat rare earth elements 71 ).We attribute this limitation of 'generic/universal' forceelds to the algebraic dependencies that mathematically couple dihedrals to each other and to other exibility parameters due to the redundancy of exibility parameters (especially dihedral angles).In contrast to non-bonded parameters that exhibit a high degree of transferability across similar chemical environments for a given second-neighbor-based atom type, 57 the redundancy of exibility parameters (especially dihedrals) impairs transferability of the exibility parameter values (especially torsion potentials) between two different chemical building blocks.Because this redundancy is difficult to remove or avoid, and because torsion potentials are exquisitely sensitive to the chemical environment, we believe it is generally preferable to optimize exibility parameter values specically for each chemical building block rather than trying to transfer their values across different chemical building blocks.Here, the term 'chemical building block' could mean either a specic bonded cluster (such as a molecule or a MOF) or a specic monomer in a polymer (e.g., a specic amino acid in a protein sequence, a specic base pair in DNA, or a specic RNA base, etc.).Thus, our strategy is to create an automated workow that optimizes exibility parameters specically for each material.
Our protocol develops new best practices for the typing of bonds, angles, and dihedrals.We use Chen and Manz's secondneighbor-based atom typing scheme to dene the atom types. 57o minimize (but not eliminate) internal coordinate redundancy, angles in 3-and 4-membered rings are agged in the internal coordinate list and not used in the angle-bending potential, while diagonals in 4-membered rings are added to the list of Urey-Bradley 72 stretches.A key strength of our parameterization protocol is the more accurate and more automated treatment of dihedral torsion modes than prior literature approaches.Key improvements of our approach include: (a) Automated pruning of dihedral types to reduce (but not completely eliminate) internal coordinate redundancy; our protocol does this in a way that preserves symmetry equivalency.
(c) Our protocol specically performs a series of quantummechanical calculations for scans along each rotatable dihedral type.Our protocol automatically analyzes this data to determine which specic subset among the rst seven possible torsion modes contribute to each rotatable dihedral torsion energy curve.This ensures each rotatable dihedral term has optimal form.
(d) Our protocol samples the rotatable dihedral barriers thoroughly by including an energy scan for each rotatable dihedral type when optimizing all of the force constants.
(e) As described above, our protocol uses Manz's 50 new anglebending and dihedral-torsion model potentials that avoid derivative discontinuities (i.e., force discontinuities).
Previous forceelds included some but not all of these aspects for modeling dihedral torsions.The AMBER forceeld uses a truncated Fourier series expansion of the torsion potential for which particular modes were manually selected for different dihedral types based on dihedral scans (using quantum-chemistry calculations) to generate potential energy curves. 73,74Barone et al.'s forceeld parameterization protocol for molecules included (i) the classication of each dihedral as so or stiff, (ii) dihedral scans (using quantum-chemistry calculations) to generate potential energy curves for so dihedrals, and (iii) a truncated Fourier series expansion of the torsion potential for so dihedrals. 5Grimme's QMDFF parameterization protocol included (i) the classication of each dihedral as rotatable or non-rotatable, (ii) dihedral scans (using tight-binding calculations) to generate potential energy curves for rotatable dihedrals, and (iii) a four-term distancedamped modied Fourier series expansion of the torsion potential for rotatable dihedrals. 6ur protocol includes physically-motivated non-negative bounds for some of the force constants.Specically, we constrained force constants for the bond stretches, Urey-Bradley stretches, angle bends, non-rotatable/hindered dihedral torsions, and linear-dihedral torsions to be non-negative.We did not apply bounds to the bond-bond cross terms.If a rotatable dihedral torsion type had more than one active mode, no bounds were applied to the force constants associated with this torsion.If a rotatable dihedral torsion type had only one active mode, the force constant associated with this lone torsion mode was constrained to be non-negative.These choices are physically motivated as described in Section 7.4.
The remainder of this article is organized as follows.Section 2 describes the specic model potentials we used for bond stretches, angle bends, dihedral torsions, and other exibility terms (e.g., Urey-Bradley interactions, bond-bond cross terms).Section 3 gives an overview of the major features of our SAVESTEPS approach.Section 4 describes the crystal geometry verication steps we performed to ensure the crystal structures chosen were reliable.Section 5 describes the identication of bond types, angle types, and dihedral types.Section 5 also describes the pruning of redundant dihedral types and the classication of each dihedral type as rotatable, hindered, non-rotatable, or linear.Section 6 describes the quantum chemistry methods.Section 7 describes the rotatable dihedral mode selection and the regularized linear leastsquares tting that we performed to optimize all force constants.Section 7 also contains formulas for computing R-squared and root-mean-squared error (RMSE) that quantify how well the model performed.Section 8 presents and analyzes the computed exibility parameterization results.Section 9 investigates whether the force constant values are transferable for matched types occurring in two different chemical structures.In Section 10, the heat capacity and coefficient of thermal expansion computed using molecular dynamics simulations for IRMOF-1 are compared to experimental measurements and to values computed using other forceelds.Section 10 also presents these computed bulk properties for MIL-53(Ga) using our exibility model.Section 11 concludes.Note: in this article, function arguments are enclosed in square brackets; for example, h[q] would denote a function h that depends on q, while h(q) would denote h multiplied by q.

Types of exibility terms to include
As reviewed in the literature, the bonded interaction potential in non-reactive exible forceelds is typically constructed by combining bond stretch, angle bend, dihedral torsion, (optionally) Urey-Bradley, (optionally) cross terms, and (optionally) concurrence terms: Fig. 1 illustrates types of bonded interactions studied in this work.Without loss of generality, we can write the bonded interaction potential energy for an individual bonded cluster as a linear combination of exibility terms 50 where k j is the force constant, {a h } are the corresponding active internal coordinates, and {a eq h } are the equilibrium values of these internal coordinates in the isolated bonded cluster's optimized ground-state geometry.The product k j g j [{a h ,a eq h }] is called a 'exibility term'.Since the internal coordinate values are a function of the material's geometry, we can also use the functional form where By default, our protocol uses a harmonic bond stretch potential between all rst bonded neighbors (e.g., two atoms A and B directly bonded to each other): where k is the force constant, d is the current bond length, and d eq is the reference value of the equilibrium bond length in the optimized ground-state geometry.This harmonic bond stretch is simple, popular, and easy to parameterize.If desired, our protocol could be used with other bond stretch potentials including but not limited to Morse, 80 quartic, 79 MM3, 8,81 rigid (i.e., an inexible bond with xed length), etc.The Morse, quartic, and MM3 bond stretch potentials include some anharmonic terms.The Morse potential approaches a constant value as the two atoms get far apart.Each 3-membered ring is a triangle whose shape is completely determined by the 3 bond lengths forming the triangle's edges.Since these three bond lengths are included in their corresponding bond stretch potentials, by default our protocol omits angle bends for angles internal to 3-membered rings.
Urey-Bradley (UB) interactions are distance-dependent interactions between second-bonded neighbors. 72Each 4-membered ring has 4(3)/2 = 6 internal relative distances, so only 6 internal coordinates are required to describe its shape.These 6 internal coordinates can be constructed by using 4 bond stretches for the ring's 4 edges plus two UB terms for the ring's two diagonals.This is a more compact representation of the internal degrees of freedom than using 4 angle bends plus 4 bond stretches.Accordingly, by default our protocol includes UB terms for the diagonals of 4-membered rings and omits angle bends for angles internal to 4-membered rings.By default, our protocol uses the harmonic stretch potential (eqn (7)) for these UB terms.Although not included by default, our protocol could also include UB terms between additional pairs of second-bonded neighbors.
In a companion article, one of us introduced a new anglebending potential that has four distinct advantages: 50 (1) It has a quadratic-like form for small displacements from the equilibrium bond angle over the entire range of possible equilibrium bond angles: 0 < q eq # p.
(2) It has continuous derivatives of all orders for all angle values even at q = p.
(3) As the bond angle approaches zero (i.e., q = 0), the anglebending potential energy tends towards innity.This mimics the Pauli repulsion of electrons that energetically prohibits bond angle values from reaching zero.
(4) It has a simple analytic form with only a single adjustable parameter, which is the force constant k angle .
To the best of our knowledge, no previous angle-bending potential simultaneously has all four of the above features.This new angle-bending potential has the form 50 Although it is possible to use other angle-bending model potentials with our SAVESTEPS protocol, the above anglebending potential is preferable and was used in this work.
One of the key strengths of our SAVESTEPS protocol is a comprehensive yet computationally efficient treatment of dihedral torsions.In a companion article, one of us derived new dihedral-torsion model potentials 50 that we used in this work.These dihedral-torsion model potentials are described in the next section.Although it is possible to use other dihedraltorsion model potentials with our SAVESTEPS protocol, Manz's new dihedral-torsion model potentials have many compelling advantages. 50ur protocol can optionally include various types of cross terms.Some types of cross terms described in the prior literature include bond-bond, bond-bend, bend-bend, bondtorsion, bend-torsion, and others. 8,79,81,82In this work, we compared the performance of exibility models optimized with and without bond-bond cross terms.We used the following model potential for bond-bond cross terms: Cross terms and/or UB terms are sometimes required to match the experimental vibrational spectrum.For example, a carbon dioxide molecule has three elementary vibrational modes: (i) a symmetric stretch at 1333 cm −1 , (ii) an antisymmetric stretch at 2349 cm −1 , and (iii) a wag (i.e., angle-bending) mode at 667 cm −1 wavenumber. 83Here, the symmetric and In this case, model_1's Hessian expressed in terms of internal coordinates (d AB , d BC , q ABC ) is where the off-diagonal terms are zero because both of the following conditions are satised: (a) no cross-terms were included in this forceeld and (b) the internal coordinates are independent of each other (i.e., non-redundant).Since this Hessian is diagonal, it immediately follows that these three internal coordinates are the normal vibrational modes.Due to the symmetry of the two C-O bonds in a CO 2 molecule, we have in this case Consequently, two vibrational frequencies are predicted by this forceeld model to be energy degenerate.These two degenerate bond vibrational modes can be linearly combined to yield degenerate symmetric and antisymmetric stretch modes.Because such a forceeld yields symmetric and antisymmetric stretch modes that have the same frequency, it cannot approximate the carbon dioxide molecule's experimental vibrational spectrum.Consequently, a cross term and/or an UB term must be added to this forceeld to resolve this problem.This derived result is general and holds irrespective of the particular functional forms of U bond [d AB ] and U angle [q ABC ].
However, sometimes cross terms and/or UB terms are not required.For example, an isolated water molecule has three elementary vibrational modes: (i) a symmetric stretch at 3657 cm −1 , (ii) an antisymmetric stretch at 3756 cm −1 , and (iii) a wag (i.e., angle-bending) mode at 1595 cm −1 wavenumber. 83he theoretical analysis parallels that for the CO 2 molecule described above, except that for a H 2 O molecule the symmetric and antisymmetric stretches have frequencies that differ from each other by only ∼3%.Consequently, a forceeld model of the form shown in eqn ( 12)-( 14) above can provide a reasonably good t to the water molecule's experimental vibrational spectrum.
Many exible forceelds described in the prior literature include concurrence terms. 6,8,30,82,84,85Mathematically, a point of concurrence is where three or more line segments meet at a point.In a material, this corresponds to the situation in which three or more bonds share a common atom.Like cross terms, concurrence terms rene the potential energy expression beyond the basic description provided by bonds, angles, and dihedrals.Consider the ammonia (NH 3 ) molecule as an example.At its equilibrium ground-state geometry, the three H-N-H angles in NH 3 sum to a value smaller than 2p; however, these three angles sum to exactly 2p in the planar transition state for the inversion reaction.Because these angles have a different value in the transition state than in the ground state structure of ammonia, using an angle-bending potential by itself already gives a positive inversion barrier without including a special concurrence term in the forceeld.However, it may be desirable to include a special concurrence term in the forceeld to ne-tune the inversion barrier's value.As another example, consider a planar molecule such as benzene.Suppose that atom C( 1 3) and H, those three angles sum to less than p. Accordingly, using an anglebending potential by itself already gives an out-of-plane energy increase for benzene without including a special concurrence term in the forceeld.However, it may be desirable to include a special concurrence term in the forceeld to ne-tune the potential energy.In the prior literature, concurrence terms are typically constructed using one of the following chemical descriptors: outof-plane distance, out-of-plane angle, and/or improper-dihedral 6,8,30,84,85 (in this work, the standalone term 'dihedral' always refers to a proper dihedral, while 'improper-dihedral' will always be explicitly used for improper-dihedrals).
Since adding more terms (e.g., cross terms, concurrence terms, anharmonic terms, etc.) increases the forceeld's complexity, a key question is how to identify which particular terms substantially improve the forceeld's accuracy and which are insignicant.Our protocol includes two major innovations to address this question.As described in Section 8.7, our protocol computes statistics (e.g., R-squared and RMSE) for individual atoms in a material to identify how well the exibility model performs for different atoms in the material.This highlights particular atoms (if any) for which the exibility model needs to be improved.Our protocol also incorporates several embedded feature selection techniques.During least-squares optimization of the force constants, our protocol uses the LASSO method to identify which forceeld terms are necessary and which are unnecessary for constructing the exibility model.Our protocol automatically generates a concise exibility model that identies and includes only those terms that are valuable.In this work, we used this approach to identify and select which particular bond-bond cross interactions are valuable.Our protocol could also use this approach for other types of cross terms, concurrence terms, anharmonic terms, etc.

Dihedral torsion potentials
The dihedral torsion potential has ve major cases.Case 1: the dihedral type is classied as rotatable, and one or both of the included equilibrium bond angles is $130°.In this case, the following angle-damped-dihedral-torsion (ADDT) potential is used which has up to seven modes: 50 G ADDT mode_5 ½q ABC ; q BCD ; f ABCD ¼ G ADDT mode_6 ½q ABC ; q BCD ; f ABCD ¼ G ADDT mode_7 ½q ABC ; q BCD ; f ABCD ¼ Aer dihedral mode smart selection (see Section 7.1), this yields U ADDT ABCD ½q ABC ; q BCD ; f ABCD À U ADDT ABCD ½q eq ABC ; q eq BCD ; f eq ABCD ¼ X N ABCD active_modes j¼1 U ADDT mode_mj ½q ABC ; q BCD ; f ABCD (23)   where N ABCD active_modes is the number of active modes for dihedral ABCD.The angle-damping factors are dened as follows: where H eq ABC ¼ cos½q eq ABC =2 Case 2: the dihedral type is classied as rotatable, and both of the included equilibrium bond angles are <130°.In this case, the following constant-amplitude-dihedral-torsion (CADT) potential is used which has up to seven modes: 50 Aer dihedral mode smart selection (see Section 7.1), this yields where N ABCD active_modes is the number of active modes for dihedral ABCD.
Case 3: the dihedral type is classied as non-rotatable or hindered, and one or both of the included equilibrium bond angles is $130°.In this case, the dihedral has a restricted range of motion.For small dihedral displacements a harmonic-like potential is sufficient, and this can be approximated by a single torsion mode.Since one of the included equilibrium bond angles is $130°, we still need to include the angle-damping factors.Consequently, for Case 3 we used only mode 1 from the ADDT potential: Case 4: the dihedral type is classied as non-rotatable or hindered, and both of the included equilibrium bond angles are <130°.In this case, the dihedral has a restricted range of motion.For small dihedral displacements a harmonic-like potential is sufficient, and this can be approximated by a single torsion mode.Since both of the included equilibrium bond angles are <130°, the constant torsion amplitude approximation can be used.Consequently, for Case 4 we used only mode 1 from the CADT potential: For Cases 3 and 4, the use of a single torsion mode is an approximation that holds only if the dihedral's displacements are small.If a non-rotatable or hindered dihedral exhibits large displacements (during thermal vibrations) away from the dihedral's equilibrium value, then it could become necessary to add more torsion modes to this dihedral's potential model (we did not perform such an addition for any MOFs studied in this work).Note that in eqn (15), eqn (34), eqn (40) and eqn (41) the raised 'm' or '1' is a superscript index not an exponent.Case 5: in this case, one or both of the equilibrium bond angles in the dihedral is linear (or close to linear).'Close to linear' means that p − q eq ABC < 3 or p − q eq BCD < 3, where 3 is a tolerance (e.g., 0.03 radians).We reiterate that this case applies when one or both of the equilibrium bond angle values is linear irrespective of the instantaneous bond angle value.For these 'linear dihedrals', Manz's ADDT linear model potential 50 (which contained two torsion modes) was used as described in ESI Section S1. † Aer dihedral pruning, only 5 of the 116 MOFs in our study had linear dihedrals.For comparison purposes, we also completely reoptimized the exibility parameterization for these 5 MOFs using an analogous exibility model except the ADDT linear model potential was omitted.We found that the validation dataset R-squared and RMSE (eV Å −1 ) values for these ve MOFs changed little (e.g., in third or fourth signicant digits) when the linear dihedrals were omitted from the exibility model.However, for completeness in the remainder of this article all of the results for these ve MOFs included our ADDT linear model potential.The ADDT linear model potential was not used for the other 111 MOFs that had no aer-pruning linear dihedrals.

Overview of the SAVESTEPS approach
As shown in Fig. 2, SAVESTEPS is an acronym constructed from some of the major features of our approach.Our approach excels particularly at: chemical structure verication; extensive automation; state-of-the-art typing of atoms, bonds, angles, and dihedrals; dihedral pruning that preserves symmetry equivalency; classication of each dihedral type as rotatable, nonrotatable, hindered, or linear; smart selection of torsion modes for each rotatable dihedral type; state-of-the-art angle-bending and dihedral-torsion model potentials; model potentials having improved numeric stability even for linear bond angles; the ability to use linear regression instead of requiring nonlinear regression to optimize values of the exibility parameters; embedded feature selection using the LASSO method to identify and zero out unimportant forceeld terms; thorough training and validation sets; non-negative bounds on force constants for bond stretches, Urey-Bradley stretches, angle bends, single-mode torsions, and ADDT linear torsion modes; insightful statistics both for the whole material and for individual atoms in the material; and excellent computational efficiency.No previously published approach to optimize exibility parameters for classical forceelds has the complete set of these features.
Fig. 3 is a owchart summarizing our automated protocol to construct exibility parameters for classical forceelds.The key steps in this protocol are: Step # 1: the starting chemical structure and geometry are checked for misbonded atoms and other chemical errors.Structures with chemical errors are not accepted.
Step # 2: a quantum chemistry calculation is performed to nd the material's optimized ground-state geometry.For periodic materials, the lattice constants dening the unit cell's size and shape are preferably held xed at the experimental values (if known) while the atom-in-material positions are quantummechanically relaxed (if experimentally-measured lattice vectors are not available or not reliable, then quantummechanically-computed lattice vectors can be used to determine the unit cell's size and shape).If the material's experimental geometry is available, the quantum-mechanicallycomputed geometry is compared to the experimental geometry to ensure they match within a reasonable tolerance.The optimized structure is rechecked for misbonded atoms and other chemical errors.Structures with chemical errors are not accepted.
Step # 3: for the quantum-mechanically-computed optimized ground-state geometry, typing is performed to generate lists of atom types and internal coordinate types (e.g., stretch types, angle types, and dihedral types).To be classied as the same type, two specic occurrences of an internal coordinate must satisfy all three conditions: (i) They must have the same combination and order of atom types.(ii) The internal coordinate's absolute value for the second occurrence must match that of the rst occurrence within a chosen tolerance.(iii) Two angles of the same type must contain the same combination and order of bond types.Two dihedrals of the same type must contain the same combination and order of angle types.
Step # 4: some adjustments are made to the active internal coordinate types list: (1) Urey-Bradley stretches are added for the two diagonals of each 4-membered ring.(2) Angles in 3-and 4-membered rings are agged in the internal coordinates list and not used in the angle-bending potential.(3) Dihedrals containing angles from 3-and 4-membered rings are removed from the active internal coordinates list.(4) Dihedral ABCD is classied as 'linear' if either p − q eq ABC < 3 or p − q eq BCD < 3, where 3 is a tolerance (e.g., 0.03 radians).
Step # 5: a dihedral instance is classied as non-rotatable if its middle bond is part of a bonded ring; otherwise, it is clas-sied as rotatable.A dihedral type is classied as non-rotatable iff any one or more of its instances is non-rotatable.
Step # 6: if two or more different dihedral types pass through the same middle bond type, the lists of middle bond instances for these two dihedral types are compared to see if they are identical.Repetitions and ordering do not matter in this comparison.For example, the list of middle bond instances {a,b,c,a,b} is considered equivalent to {a,c,b} in this comparison.If two or more different dihedral types have equivalent middle bond instances (where repetitions and ordering do not matter in this comparison), only one of these dihedral types is retained in the active internal coordinates list.Since symmetryequivalent dihedral instances are grouped into the same dihedral type, this dihedral pruning preserves the symmetry equivalency while reducing redundancy.
Step # 7: a set of quantum chemistry calculations is performed corresponding to nite displacements along each independent atom translation mode (aka nite-displacement 'Hessian' geometries) plus a set of AIMD-generated geometries that together with the optimized geometry comprise the force training set (as shown by the computational tests in ESI Section S5, † including AIMD-generated geometries in the force training dataset improves the exibility model's accuracy).A completely independent set of AIMD geometries is generated to make the validation set.
Step # 8: for a rotatable dihedral type, one dihedral instance is randomly selected and uniformly displaced (e.g., in 10°i ncrements) over its full range to generate a set of dihedral-Fig.3 Flowchart summarizing our automated protocol to construct flexibility parameters for classical forcefields.Steps performing quantum chemistry calculations are shaded tan.Steps involving the typing of atoms, bonds, angles, or dihedrals are shaded blue.Steps involving linear regression and statistical performance are shaded green.Icons are used to represent particular data that is generated (no separating line) or used (separating line) in a particular step: optimized geometry (X), training AIMD geometries and forces (eight-pointed star), finite-displacement (aka 'Hessian') geometries and forces (triangle), classification of each dihedral as rotatable, hindered, nonrotatable, or linear (hourglass), displaced rotatable dihedral single-point energies (checkmark), active internal coordinate list (heart), and validation AIMD geometries and forces (raindrop).
displaced geometries.Each such dihedral-displaced geometry is then analyzed to identify its atom types.If the atom type for each atom in each dihedral-displaced geometry matches that for the corresponding atom in the optimized ground-state geometry, the dihedral type retains its rotatable classication; otherwise, it is reclassied as a 'hindered dihedral type'.A hindered dihedral corresponds to the situation in which its rotation over some values changes the material's bond connectivity and hence changes the atom type of one or more atoms.This process of generating and analyzing dihedraldisplaced geometries is performed sequentially one rotatable dihedral type at a time (always starting from the optimized ground-state geometry) until all rotatable dihedral types in the active internal coordinate list have been analyzed and classied as 'rotatable' or 'hindered'.
Step # 9: for all dihedral types that retained 'rotatable' clas-sication, single-point quantum chemistry calculations are performed for their dihedral-displaced geometries that were generated in Step # 8 above.This yields a quantummechanically-computed total energy for each such dihedraldisplaced geometry.
Step # 10: for each rotatable dihedral type, the energy curve for its dihedral-displaced geometries is projected onto a set of seven orthonormal torsion modes (as described in Sections 2, 7.1, and 8.5) to identify and smart select the particular torsion modes that contribute signicantly to this energy curve.
Step # 11: although there is some leeway in how to construct the potential model, the following describes a preferred choice.Manz's 50 angle-bending potential is preferably used for each of the active bond angles.For non-rotatable, hindered, and rotatable dihedrals, a CADT model is preferably used iff both contained bond angles are less than 130°; otherwise, an ADDT model is preferably used.Each non-rotatable or hindered dihedral type is normally described by a torsion potential containing a single mode (e.g., mode 1); however, if desired another torsion mode could be added to describe anharmonicity.Each rotatable dihedral type is described by a torsion potential containing the smart selected modes.Dihedrals for which at least one of the contained equilibrium bond angles is linear are preferably modeled using the ADDT linear torsion modes.Either a simple harmonic potential or a more sophisticated potential could be used for the bond and Urey-Bradley stretches.Where desired, cross terms, concurrence terms, and/ or other terms can be optionally included.
Step # 12: the force-constant values are optimized via regularized linear least-squares tting.Non-negative bounds are placed on the force constants for bond stretches, Urey-Bradley stretches, angle bends, lone-mode torsions, and ADDT linear torsion modes.No bounds are placed on force constants for bond-bond cross terms and multi-mode rotatable torsions.LASSO regression is used to automatically identify and zero out the force constants for unnecessary exibility terms.The training dataset includes: (a) A full dihedral scan energy curve for each rotatable dihedral type (if any are present in the material).
(b) Quantum-mechanically-computed atom-in-material forces for the material's optimized ground-state geometry.These forces are zero within a convergence tolerance.
(c) Quantum-mechanically-computed atom-in-material forces for nite-displacement 'Hessian' geometries that sample x, y, z displacements for each atom in the material.
(d) Quantum-mechanically-computed atom-in-material forces for AIMD-generated geometries.Geometries are included for at least ten independent AIMD runs.
Step # 13: using the optimized force-constant values from Step # 12 above, the R-squared and RMSE values are computed for the validation set of geometries.This tests how well the exible forceeld model reproduces atom-in-material forces across a new set of AIMD-generated geometries that were not used in training the model, as well as in the optimized groundstate geometry.R-Squared and RMSE values are also computed and printed for each individual atom in the material to identify particular atoms (if any) for which the forceeld needs to be improved.

Crystal geometry verification
In 2014, Chung et al. 86 published a Computation Ready Experimental (CoRE) MOF database that was created by rst screening the Cambridge Structural Database (CSD 87 ) to nd MOFs and then partially cleaning these structures.Their cleaning process aimed to remove solvent molecules and other small adsorbates from the MOF's pores, keep charge-balancing ions, and x or eliminate any structures with disordered atoms and partial occupancies.Also, some of the structures had missing hydrogen atoms added to them.][90] CoRE MOF 2019 is an updated version of the database containing thousands more structures than the 2014 version. 91tructures with only free solvent molecules removed are found in the free solvent removed (FSR) set.Structures in the all solvent removed (ASR) set have both bound and free solvent molecules eliminated.These modied structures are designated as the FSR-public and ASR-public datasets. 91Chung et al. reported the original CSD refcode as the relevant structure in instances when the FSR or ASR processes did not result in any molecules being removed or any other modications to the structure; these are designated as the FSR_CSD and ASR_CSD datasets. 91n 2017, Moghadam et al. 92 constructed a CSD MOF subset using seven "look-for-MOF" criteria to locate and extract MOF materials from the CSD database.Moreover, a variety of computational techniques were developed and employed to rst exclude the solvent molecules from the CSD MOF subset and create a CSD non-disordered MOF subset. 92o identify structures with isolated or mis-bonded atoms, Chen and Manz 93 screened the 2019 CoRE MOF database for the following: (i) atoms not directly bonded to any neighboring atoms (aka, 'isolated' atoms), (ii) atoms that are too close together (aka, overlapping atoms), (iii) misplaced hydrogen atoms, (iv) under-bonded carbon atoms (this could result from missing hydrogen atoms) and (v) over-bonded carbon atoms.MOFs that passed this screening procedure were placed into accepted_FSR (for free solvent removed) and/or accepted_ASR (for all solvent removed) subsets of the CoRE MOF database.
In 2021, Daglar et al. 94 showed that a considerable number of MOFs are reported with the same refcode but different reduced chemical compositions in the 2019 CoRE MOF database versus the CSD non-disordered MOF subset.They claimed that 2434 MOFs had the same reduced chemical formula in both databases; these are known as chemical formula matched-MOFs (CFM-MOFs). 941109 MOFs had different reduced chemical formulas in one database compared to the other database; these are known as chemical formula unmatched-MOFs (CFU-MOFs). 94They demonstrated how the database used affects the simulation results of 1109 CFU-MOFs by yielding signicantly different gas uptakes. 94n 2021, Rosen et al. 95,96 used a high-throughput periodic DFT methodology using the PBE-D3(BJ) data initially to create the QMOF database of quantum-chemical characteristics for MOFs.They accounted for the list of materials classied as MOFs from the 2019 CoRE MOF database as well as the CSD non-disordered MOF subset.They rst ltered problematic MOFs that had missing H atoms, fractional occupancies, missing framework atoms, lone (i.e., unbonded) atoms, overlapping atoms, an insufficient number of charge-balancing ions, and other structural problems. 95,96Aerwards, they performed DFT calculations on MOFs that passed this screening process.The QMOF database currently includes more than 20 000 experimentally synthesized MOFs with publicly available parameters determined by DFT such as optimized geometries, density of states, and DDEC6 population analysis results (e.g., net atomic charges, [58][59][60] atomic spin moments, 58,97 and bond orders 58,98 ). 95,96aken together, the above studies cast some doubts on the quality of available databases containing partly cleaned experimentally-derived MOF structures.What happens if a particular MOF has different chemical structures in different partly cleaned experimentally-derived MOF databases?In such a case, how does one decide which (if any) of the reported structures for the MOF is chemically reasonable?An obvious way to mitigate this issue is to use a subset of MOFs that have the same reported chemical structure in several partly cleaned experimentally-derived MOF databases.Because these various databases used different cleaning procedures, a MOF that has exactly the same 'cleaned' chemical structure in several of these databases is more likely to be trustworthy.For example, a MOF missing hydrogen atom(s) might pass through one database's cleaning procedures but be rejected by a different database's cleaning procedures.If one or more of the databases added in missing hydrogen atoms, their placement is suspect if two databases do not agree on the hydrogen atom positions.As another example, a particular adsorbed solvent molecule in a particular MOF might be removed by one database's cleaning procedures but not removed by a different database's cleaning procedures.These disagreeing structures will be rejected if we select a subset of MOFs that have the same chemical structure in several partly cleaned experimentally-derived MOF databases.
As shown in Fig. 4, we applied a crystal geometry verication procedure to select MOFs with reliable chemical structures.In Fig. 4, each number in parentheses is the number of MOF structures satisfying that criterion.First, we checked whether a MOF was listed in and had the same reduced chemical formula in Chen and Manz's accepted_FSR and accepted_ASR datasets.Selecting MOFs that have the same chemical structure aer free solvent removal as aer all solvent removal reduces ambiguity in the solvent removal process.Moreover, these accepted structures had passed through Chen and Manz's screening process to identify misbonded and lone atoms. 93For those MOFs in the ASR and FSR public datasets, we then checked to see if they were in Daglar et al.'s 94 CFM-MOF dataset For each MOF (whether in the public or CSD portion of the ASR and FSR databases) that passed the above screening criteria, we next checked whether it was listed in the QMOF database and had the same chemical formula in the QMOF and 2019 CoRE MOF databases.Because the QMOF database applied some different cleaning procedures than the 2019 CoRE MOF database, this screening criterion selects MOFs whose chemical structure is more robust because it passed through different cleaning procedures.Then we performed atom typing on the DFT-optimized QMOF structure and the experimentallyderived 2019 CoRE MOF structure using Chen and Manz's 57 atom-typing procedure.This criterion ensured that the MOF's structure did not drastically change during DFT geometry optimization.For example, this screening criterion rejects a MOF is that is unstable aer adsorbed solvent is removed from its pores and consequently changes bond connectivity during DFT geometry optimization.
Because magnetic MOFs present greater computational challenges to converge each DFT calculation to the correct magnetic ordering, we decided for simplicity to restrict the current study to non-magnetic MOFs.We emphasize that the SAVESTEPS protocol introduced in this manuscript applies also to magnetic materials, but it is more work since care must be exercised to ensure each quantum-chemistry calculation converges to its low-energy magnetic ordering.
We then performed a visual inspection of each MOF using a chemical visualization program.This step serves as a sanity check by ensuring the MOF's structure has been viewed by a human expert.The purpose of this step is to remove any MOFs that appear to have chemically unstable structures and/or undesirable chemical linkages.Rejection or acceptance at this visual inspection stage is subjective according to the human expert's judgement and experience.Reasons for rejecting MOFs at this step included (but was not limited to) the following.Structures containing rings of 5 to 8 atoms containing four or more nitrogen atoms within the same ring (e.g., tetrazole rings) were rejected, because these may potentially thermally decompose releasing N 2 gas.Structures containing high concentrations of N-N linkages were also rejected, because these may potentially thermally decompose releasing N 2 gas.Structures that contained free or weakly bound ions that may potentially dissociate were also rejected.ions ([SO 4 ] 2− ).Some structures containing high concentrations of Cu-C-N-Cu linkages were rejected.This visual inspection also checked for misbonded atoms (e.g., overlapping atoms, misplaced hydrogen atoms, missing hydrogen atoms, etc.), but we did not nd any misbonded atoms at this point.This indicates that the earlier screening for misbonded atoms was reliable.
Aer visual inspection, we performed DFT geometry optimization on each MOF holding the unit cell's size and shape rigid at the experimental values.Aerwards, we recalculated the MOF's atom types (using Chen and Manz's 57 procedure) and checked that these were the same as atom types extracted from the experimentally-derived 2019 CoRE MOF structure.This step rejected any MOF whose bond connectivity changed during DFT geometry optimization.
Our goal was to select at least 100 MOFs for exibility parameters optimization, so aer identifying 116 MOFs that passed all of the above criteria, we stopped searching.Likely, there are additional MOFs that would have passed all of the above criteria, but we did not continue looking for them, because our goal was already reached.

Overview and ow diagram
Previously, Chen and Manz 57 worked on the large-scale computation of atom types and forceeld precursors.In contrast to atom types based on only rst neighbors, they demonstrated that atom types based on rst and second neighbors can accurately capture the chemical environment. 57pecically, they showed that the standard deviation of calculated forceeld precursors was signicantly high for atoms with similar rst-neighbor environments but comparatively small for atoms with similar rst-and-second-neighbor environments. 57or instance, the atom type 6[1-(0),1-(0),1-(0),6-(1,1,8)] designates a central carbon atom with four rst neighbors (H, H, H, and C), where each of the rst-neighbor H atoms is not directly linked to any second neighbors and the rst-neighbor C atom is directly bonded to H, H, and O in addition to the central atom. 57e used this method to compute each atom's type in a MOF's geometry.
In this study, we wrote Python codes to rst identify all the existing bonds, angles, and dihedrals in any given MOF geometry.Some of these will be placed on an 'active list' that will be used to construct the exibility model.The lists of active bond, angle, and dihedral types and instances are generated using the protocols described in Sections 5.2, 5.3, and 5.4, respectively.Fig. 5 summarizes the workow to generate these lists of active internal coordinate types and instances.This information is essential to building a potential energy model describing a particular MOF's exibility; that is, to construct U bonded;new particular_MOF ½f RA ; ℤ A g that can be used in a exible forceeld (see eqn (1)).
Our SAVESTEPS protocol requires that the unit cell used is large enough that each atom A is directly bonded to only one image of a particular rst-neighbor atom B. This is a feature not a bug.Consider a material such as NaCl crystal that has a small primitive unit cell.If we dene the unit cell to contain only one Na and one Cl atom, then during AIMD simulations all Cl atoms will move in unison, because they are periodic images of the same reference Cl atom.Because there is no such thing as a Cl-Cl vibrational stretch when using such a small unit cell, it follows that using such a small unit cell overly restricts the atom-in-material motions.To resolve this problem, a larger unit cell must be used such that each atom A is directly bonded to only one image of a particular rst-neighbor atom B. For NaCl crystal, we could accomplish this by creating a supercell that contains many Na and Cl atoms, and then use this supercell as our periodic unit cell during all of the quantum chemistry calculations and subsequent exibility model development.This enables Cl-Cl vibrational stretches to exist and be included in the parameterized exibility model for NaCl crystal.

Generating the list of stretches to use in the forceeld
Iff the distance between two atoms was less than or equal to the sum of their atom typing radii, we classied them as directly bonded to each other. 57For each atom A in the reference unit cell, we checked for its bonds to any other atom images {(B,0,0,0)} in the reference unit cell and also for its bonds to any atom images {(B,L 1 ,L 2 ,L 3 )} in the 26 unit cells surrounding the reference unit cell.
All of these bond instances were added to the list of stretch instances.To the list of stretch instances, we also added Urey-Bradley (UB) second-neighbor stretch instances for diagonals of 4-membered rings.Fig. 6 illustrates the information stored in each stretch instance.Each stretch instance stored the two atom numbers, the unit cell translation indices for each atom, the stretch type index, whether the stretch is a bond stretch or UB stretch, and the number of times this stretch instance appears in the list.Within a stretch instance, the two atoms are ordered such that their atom types are in alphabetical order; this makes it easier for the code to lookup stretch instances of the same stretch type.
The last number (i.e., the number of times this stretch instance appears in the list) is important to avoid doublecounting when computing the potential energy (this number will be either 1 or 2).For example, a stretch instance of the form [A, (0,0,0), B, (−1,0,0), ., 2] will appear again in the list as [A, (1,0,0), B, (0,0,0), ., 2].Specically, if there are N duplicates duplicate instances of the same bond stretch instance in the list, then the factor of (1/N duplicates ) will be applied to each duplicate when computing the potential energy, so that the potential energy for this instance is counted exactly N duplicates (1/ N duplicates ) = 1 time.
Whether or not to include some translationally displaced duplicate instances in the list is a soware design choice.It is possible to remove the duplicate instances from the list, and this would avoid having to use the N duplicates variable.Whether it is easier to include or exclude the translationally displaced duplicate instances has to do with how the les are read, searched, and processed; however, the end results are not changed in any way as long as the soware code is correctly written to avoid double-counting.We found it easier to include those translationally displaced duplicate instances and then introduce a weighting factor to avoid double-counting.This applies not only to the stretch instances described in this section, but also to the dihedral instances described in Section 5.4 below.
Two stretch instances were classied into the same stretch type iff they had the same combination of atom types and their equilibrium lengths differed by less than a chosen tolerance.In this work, the rst instance of each stretch type was chosen as a reference and another instance containing the same combination of atom types was added to this same stretch type iff its equilibrium length differed by less than 1% of the equilibrium length of the rst instance (the reference) in this stretch type.We found this typing criterion that includes equilibrium value similarity is necessary to achieve good performance of the parameterized forceeld.If this criterion did not pass, the new instance was placed into a new stretch type instead of being added to the existing stretch type.As shown in Fig. 6, multiple stretch types that contain the same combination of atom types are distinguished by the 'stretch type disambiguation number'.Fig. 7 shows examples of bonds comprised of the same order of atom types but having dramatically different equilibrium bond lengths.Both the Li 3 and Na 3 molecules exhibit Jahn-Teller distortion in which one of the three bonds has a substantially different equilibrium length than the other two bonds.Because this bond has a substantially different equilibrium length, its stretch force constant has a value different from that of the other two bonds.For this reason, bonds of substantially different equilibrium lengths should be classied into different stretch types even if they are comprised of the same atom types.

Generating the list of angles to use in the forceeld
We rst constructed a list of all angle instances for which the center atom in the bond angle resides within the reference unit cell (and thus has translation indices equal to (0,0,0)).Each of the two outer atoms may reside in either the reference unit cell or one of its neighboring unit cells.Fig. 8 illustrates the information stored in each angle instance.The atom number of the center atom is listed rst.The atom numbers and translation indices of the outer atoms is then listed.Just as for the stretch instances, these two atoms are ordered such that their atom types are in alphabetical order; this makes it easier for the code to lookup angle instances of the same angle type.This is followed by the angle type index.The last number indicates whether the angle is inside a 3-membered ring, a 4-membered ring, or neither (an angle is considered to be part of a 3membered or 4-membered ring if both bonds comprising the angle are part of the ring.If only one bond is part of such a ring, the angle is not considered to be part of the ring.).Each angle instance appears exactly once in the list with no duplicates, so  there is no need to store the number of times each angle instance appears within the list.
In this work, angle instances that are part of 3-membered or 4-membered rings were not used in the angle-bending potential, because those degrees of freedom were already described by the bond stretches (for 3-membered rings) or UB stretches (for 4-membered rings).However, all angle instances (including those which are part of 3-membered or 4-membered rings) were used to construct bond-bond cross terms, when the potential model included bond-bond cross terms.
Two angle instances were classied into the same angle type iff they had the same combination of bond types and their equilibrium angle values differed by less than a chosen tolerance (as explained in the previous section, two instances of the same bond type necessarily have the same combination of atom types and similar equilibrium bond lengths).In this work, the equilibrium value for each angle instance was rounded to the nearest 0.01 radians.If two angle instances had the same combination of bond types and their equilibrium angle values matched (when rounded to two decimal places), then they were placed into the same angle type; otherwise, they were placed into different angle types.As shown in Fig. 8, multiple angle types that contain the same combination and order of atom types (but have different bond types or different equilibrium angle values) are distinguished by the 'angle type disambiguation number' which labels them as 0, 1, 2, .. Fig. 7 illustrates the critical importance of including the equilibrium angle value in the angle-typing scheme.For example, all bond angles in the sulfur hexauoride (SF 6 ) molecule have the same combination and order of atom types and bond types.However, there are two dramatically different types of bond angles in this molecule: (i) 90°F-S-F angles and (ii) 180°F-S-F angles.Because these two angle types can have different force constant values, they need to be included as separate angle types in the exibility model.

Generating the list of proper dihedrals to use in the forceeld
5.4.1 Constructing the dihedral types and instances.The list of dihedral instances was generated as follows.We start with the complete list of angle instances for which the center atom in the bond angle resides within the reference unit cell, as explained in Section 5.3 above.Now, a dihedral instance can be generated by adding a bond to either end of a bond angle.For example, starting with bond angle ABC, if atom C is directly bonded to atom D, then we can generate the dihedral instance ABCD.In this example, if atom A is directly bonded to atoms E and F, then we can also generate the dihedral instances EABC and FABC.
During this process, we have to keep track of the unit cell translation indices for each atom in the dihedral instance.For example, dihedral instance A(0,0,−1)B(0,0,0)C(0,1,0)D(0,1,0) means that atom A resides inside the (0,0,−1) unit cell, atom B resides within the reference (i.e., (0,0,0)) unit cell, atom C resides within the (0,1,0) unit cell, and atom D resides within the (0,1,0) unit cell.As explained in Section 5.3 above, the center atom in each angle instance resides within the reference unit cell.By examining all bonds connecting to either end of each angle instance, we can generate the full list of dihedral instances for which at least one of the two middle atoms resides within the reference unit cell.
However, a single instance containing different translation indices can be included twice within the list of dihedral instances.For example, both A(0,0,−1)B(0,0,0)C(0,1,0)D(0,1,0) and A(0,−1,−1)B(0,−1,0)C(0,0,0)D(0,0,0) will appear within the dihedral instances list, even though they are translations of the same dihedral instance.Fig. 9 illustrates the data stored for each dihedral instance.The last number is the number of times each dihedral instance appears within the list (this number will be either 1 or 2).This number is important to avoid doublecounting when computing the potential energy.Specically, if there are N duplicates duplicate instances of the same dihedral instance in the list, then the factor of (1/N duplicates ) will be applied to each duplicate when computing the potential energy, so that the potential energy for this instance is counted exactly N duplicates (1/N duplicates ) = 1 time.
When computing the number of 'stretch instances in a stretch type' and the number of 'dihedral instances in a dihedral type', the duplicates are not double-counted.For example, a bond stretch type containing the bonds A(0,0,0) B(1,0,0), A(−1,0,0)B(0,0,0), and C(0,0,0)D(0,0,0) is said to contain two bond instances rather than three, because A(0,0,0) B(1,0,0) and A(−1,0,0)B(0,0,0) are translated images of the same bond.As shown in Fig. 10, certain kinds of dihedrals are deleted from the list of dihedral instances.A dihedral instance is deleted if it contains a 3-member ring.A dihedral instance is deleted if at least one of its contained bond angles is inside a 4member ring.These dihedral instances are deleted, because one of the underlying angles is part of a 3-member or 4-member ring and does not appear in the active list of angles that are treated by the angle-bending potential.As explained in a previous section, the internal coordinate degrees of freedom of the 3-member and 4-member rings are covered by the bond stretches and UB stretches.
The remaining entries in the dihedral instance data are described as follows.The sign of the equilibrium dihedral value was set to +1 if f eq $ 0 and to −1 if f eq < 0. Each dihedral type was assigned an index number.For each dihedral instance, the index number of the dihedral type that it belongs to was stored.Also, an entry was stored that indicated whether the dihedral instance's middle bond belonged to ring: "True" = belonged to a ring, "False" = did not belong to a ring.The algorithm we used to detect rings is described in Section 5.4.2.
Two dihedral instances were classied into the same dihedral type iff they had the same combination of angle types and the absolute values of their equilibrium dihedrals differed by less than a chosen tolerance (as explained in the previous section, two instances of the same angle type necessarily have the same combination of bond types, same combination of atom types, and similar equilibrium angle values).In this work, the equilibrium value for each dihedral instance was rounded to the nearest 0.01 radians.If two dihedral instances had the same combination of angle types and the absolute values of their equilibrium dihedrals matched (when rounded to two decimal places), then they were placed into the same dihedral type; otherwise, they were placed into different dihedral types.As shown in Fig. 9, multiple dihedral types that contain the same combination and order of atom types (but have different angle types or different absolute values of their equilibrium dihedrals) are distinguished by the 'dihedral type disambiguation number' which labels them as 0, 1, 2, .. The nal entry in the dihedral type indicates whether it is classied as 'nonrotatable',  'rotatable', 'hindered', or 'linear' according to the criteria explained in Section 5.4.4.
5.4.2Identifying whether the middle bond (of a dihedral instance) is part of a ring.If a bond is part of a ring (i.e., a bond path cycle) in a periodic crystal, then at least one ring passing through the bond contains fewer than (4N atoms + 1) atoms, where N atoms is the number of atoms in the material's periodic unit cell.This can be shown by proving the smallest (i.e., shortest) bond path cycle passing through a bond contains no more than four periodic images of the same parent atom.Fig. 11 illustrates the underlying reasons for this.The region shaded pink in Fig. 11 illustrates any case in which a bond path exists from a rst image of atom A in the reference (i.e., (0,0,0)) unit cell to a second image of atom A denoted by the unit cell translation indices (m a , m b , m c ) and a bond path also exists from the rst image of atom A to a third image of atom A denoted by the unit cell translation indices (n a , n b , n c ) which does not lie along the innite line passing through the rst and second images.The latter condition is equivalent to saying that (n a , n b , n c ) does not equal (c(m a ), c(m b ), c(m c )) for any value of c.Moreover, we choose (m a , m b , m c ) and (n a , n b , n c ) such that they are the closest images to the rst image along each of these bond paths.This is equivalent to choosing (m a , m b , m c ) such that the greatest common factor of m a , m b , and m c is one, and also choosing (n a , n b , n c ) such that the greatest common factor of n a , n b , and n c is one.For example, starting with the proposed second image (2m a , 2m b , 2m c ) we divide by the greatest common factor (2 in this case) to arrive at (m a , m b , m c ) as the actual location of the second image.Because of the periodic boundary conditions, it immediately follows that a fourth image of atom A characterized the unit cell translation indices ((m a + n a ), (m b + n b ), (m c + n c )) has: (i) a bonded path to the second image of atom A and (ii) a bonded path to the third image of atom A. Thus, it follows that a bonded path exists from the rst image of atom A to the second image of atom A to the fourth image of atom A to the third image of atom A and back to the rst image of atom A. Thus, it necessarily follows that when any bond path cycle exists that passes through a particular bond, we can nd a smallest (i.e., shortest) bond path cycle passing through that bond that such that the number of translated images of the same parent atom is no more than four.Since there are N atoms in the reference unit cell, it means the shortest bond path cycle (if any exists) passing through a particular bond contains no more than 4N atoms atoms.
As shown in blue-shaded region of Fig. 11, the connected path described above from image 1 to image 2 to image 4 to image 3 and back to image 1 of atom A is not necessarily itself a bond path cycle.Specically, the blue-shaded region shows a graphene segment.Taking the lower le image of atom A to be the (0,0,0) image, the solid red path shows a bond path to image 2 of atom A, and the solid green path shows a bond path to image 3 of atom A. The dashed red path shows a bond path connecting image 3 to image 4. The dashed green path shows a bond path connecting image 2 to image 4. Interestingly, in this case the dashed green and dashed red paths overlap; consequently, the shortest bond path cycle (which happens to be a 6-membered ring) contains 3 images of atom A rather than 4.
The yellow-shaded region in Fig. 11 illustrates a simple case (e.g., a triangle connecting atoms A, B, and C) for which the shortest bond path cycle contains a single image of atom A.
ESI Section S2 † contains a rigorously correct and complete Python function we wrote that determines which middle bonds (of the dihedral instances) are parts of rings (i.e., bond path cycles) and which are not.
5.4.3Pruning redundant dihedral types.As an example, consider the ethane molecule shown in Fig. 12.Because this molecule has a total of 9 dihedral instances but only one middle bond, this means rigid rotation of any one of these dihedral instances causes all of the other 8 dihedral instances to also rigidly rotate.If we discard 8 of these dihedral instances and keep the remaining dihedral instance to construct the exibility model, then this breaks the molecule's symmetry.To preserve the symmetry equivalency, we must therefore keep and discard the dihedral types rather than individual dihedral instances.In ethane, there are two dihedral types, and these have absolute values of dihedrals of 180°(containing 3 instances) and 60°( containing 6 instances).To construct a concise exibility model that preserves the symmetry equivalency, we can keep the dihedral type with 3 instances and discard the dihedral type with 6 instances.
We use the term 'coupled dihedral types' to mean dihedrals that share the same set of middle bond instances.The process of discarding some of the coupled dihedral types is called 'dihedral pruning'.Because all dihedral instances of the same dihedral type are either all kept or all discarded, this dihedral pruning preserves the symmetry equivalency.
Our SAVESTEPS program performs dihedral pruning using the following procedure.First, it identies which dihedral types share the same set of middle bond instances.When making this comparison, repeated values are not important.For example, the set {a,b,c,d} is considered equivalent to the set For each dihedral type, the following metric is computed: where max[q eq ABC ,q eq BCD ] is the maximum value of the two equilibrium bond angles for that dihedral type, and num_instances is the number of dihedral instances in that dihedral type.Among two or more coupled dihedral types (i.e., those sharing the same set of middle bond instances), the dihedral type having the largest value of dihedral_type_metric is retained while the others are discarded.If two or more dihedral types tie for the largest value of dihedral_type_metric, then the soware program uses a random number generator to randomly select which dihedral type (among those that tied for the largest value of dihedral_type_metric) to keep and discards the others.
Among coupled dihedral types, why is a dihedral type having the largest value of dihedral_type_metric retained while the others are discarded?This has the following simple explanation.Since the CADT potential is simpler and more computationally efficient than the ADDT potential, it would be preferable to retain a dihedral type having contained bond angles far away from linear (i.e., maximizing (180°− max [q eq ABC ,q eq BCD ])).To maximize the computational efficiency, it would also be preferable to keep the coupled dihedral type that has the smallest number of dihedral instances.The dihe-dral_type_metric (see eqn (42)) combines these two criteria into a single descriptor whose value is to be maximized.
5.4.4Classifying each dihedral type as 'nonrotatable', 'rotatable', 'hindered', or 'linear'.Each dihedral type was clas-sied as 'nonrotatable', 'rotatable', 'hindered', or 'linear'.A dihedral type was classied as 'linear' iff one of its equilibrium bond angles was close to linear; that is, if either p − q eq ABC < 3 or p − q eq BCD < 3, where 3 is a tolerance (e.g., 0.03 radians).What exactly does it mean for a dihedral type to be 'non-rotatable'?Grimme classied a bond as 'nonrotatable' if it was part of a ring or had a bond order greater than 1.3. 6According to Grimme's denition, the C]C bond in ethene (i.e., C 2 H 4 molecule) would be classied as 'nonrotatable', because its bond order equals ∼2.Our dihedral typing protocol does not use the bond orders as inputs and instead classies a dihedral type as 'non-rotatable' iff at least one dihedral instance belonging to this dihedral type has a middle bond that is part of a ring (i.e., bond path cycle).Using our denition, the C]C bond in ethene would be classied as 'rotatable' even though it has a high rotational energy barrier.
Are there situations in which the rotational barrier is small even though a dihedral instance's middle bond is part of a bond path cycle?Consider a polymer made of benzene rings where the 1,4-position carbons of each benzene ring are bonded to adjacent benzene rings to form the structure (C 6 H 4 ) n .Connecting the two ends of this polymer together forms a loop (aka 'necklace').Even though each C-C bond in this 'necklace' belongs to at least one bond path cycle, it still seems possible for each benzene ring to rotate about an axis running through its 1,4-position carbon atoms.Thus, we must offer the caveat that being part of a bond path cycle 'normally' but 'not universally' hinders rotations about a middle bond.
For simplicity, our SAVESTEPS algorithm (at least in its current form) classies a dihedral type as nonrotatable iff at least one of its dihedral instances has a middle bond that is part of a ring.If a dihedral type had some dihedral instances whose middle bond was part of a ring and other dihedral instances whose middle bond was not part of a ring, then this dihedral type was still classied as 'nonrotatable'.
Consider a dihedral type for which none of its dihedral instances had a middle bond that was part of a ring.Using a random number generator, the SAVESTEPS program randomly chose one of the dihedral instances in this type.Next, we rotated the dihedral for this instance from f = −170°to 180°i n 10°increments to produce T = 36 geometries comprising a rigid torsion scan.Next, we computed the atom types for each atom in each of these geometries and compared them to the atom types in the reference geometry (see Section 6.2 for a description of the reference geometry).If the atom types in each of the rigid torsion scan geometries matched those in the reference geometry, this means no new bonds were formed and no bonds were broken during the rigid torsion scan.In this case, the corresponding dihedral type was classied as 'rotatable'.On the other hand, if any atom type changed for any atom in any of the rigid torsion scan geometries compared to the reference geometry, then the dihedral type was classied as 'hindered'.This corresponds to the situation in which the dihedral cannot rigidly rotate through its full range without forming new bonds and/or breaking old bonds.For example, this could correspond to a situation in which one chemical group collides with another chemical group (aka 'steric collision') during part of the rigid torsion scan.During the subsequent force constants optimization, hindered and nonrotatable dihedral types are treated on the same footing and use the same forms of torsion model potentials (i.e., CADT_1 or ADDT_1).
The above analysis process was individually applied to each dihedral type for which none of its dihedral instances had a middle bond that was part of a ring.In such a way, each dihedral type having no middle bond instances that were part of a ring was classied as either 'rotatable' or 'hindered'.
Why did we classify an entire dihedral type as non-rotatable even if only some of its instances were part of a ring instead of treating the instances that were not part of a ring as rotatable?This was a pragmatic choice based on two observations.Observation # 1: if a particular dihedral instance that is not part of a ring is classied as non-rotatable, this has negligible effect on small dihedral displacements but severely restricts large dihedral displacements (e.g., Df $ p/4).If this particular dihedral instance should be rotatable, the parameterized exibility model will still correctly describe small dihedral displacements but will undercount the large dihedral displacements for this particular dihedral instance.Accordingly, the parameterized exibility model will still be functional even if not exact.Observation # 2: we carefully reviewed the entire set of 116 MOFs and found that the situation of no-ring and ring dihedral instances belonging to the same dihedral type occurred in only three (i.e., AFITEP, AMOYOR, and PORVUO) of these MOFs.We then manually examined each of these MOFs using a visualization program.We found that this situation corresponded to sprawling bonded networks that were a combination of tree-branch-like structures and spider-web-like structures.The local bonded structure of a tree-branch-like structure looked identical to that of a spider-web-like structure; however, their longrange structures differed in that the tree branches were not part of a bonded ring while the spider webs were part of bonded rings.Because the tree branches were long, they would have given rise to 'hindered' rotation and thus were not freely rotatable.'Non-rotatable' and 'hindered' dihedrals use the same dihedral model potential, so the distinction between the two does not impact the parameterized exibility model.In summary, these empirical observations support the practice of classifying a dihedral type containing nonzero numbers of both ring and no-ring dihedral instances as 'non-rotatable' for pragmatic reasons.
It is critically important to use the same rigid torsion scan geometries for the test to see if the dihedral type is 'rotatable' or 'hindered' as is subsequently used for the single-point quantum-chemistry calculations to form the rotatable dihedrals training dataset.Specically, if the dihedral type is clas-sied as 'rotatable', then this process has veried that the bond connectivity graph is unchanged during the rigid torsion scan.This is an important pre-requisite for the single-point energy calculations that formed the rotatable dihedrals training dataset, as described in Sections 6.4 and 7.1-7.4.

Common settings
5][106][107][108] The project or augmented-wave (PAW) method 109,110 was used.The PAW method is a frozen-core allelectron calculation.An energy convergence criterion of 10 −6 eV was used for the self-consistent eld (SCF) cycles.The number of kpoints was set so that for each lattice vector the length times the number of k-points was greater than 16 Å.The planewave energy cut-off was set to 400 eV.A Prec = Accurate grid with Addgrid = False was used to avoid wrap-around (aka aliasing) errors.These settings were shown in previous work to give accurate results. 58

Geometry optimization
DFT-with-dispersion geometry optimization was performed allowing the atomic positions to relax with the unit cell volume and shape held xed at the experimental values taken from the 2019 CoRE MOF 91 dataset.The convergence criterion was that each force component (e.g., F x , F y , F z ) was smaller in magnitude than 0.01 eV Å −1 for each atom.This constitutes the 'reference geometry' for which all atom-in-material forces in the subsequently parameterized exibility model will be zero.
We originally applied an earlier variant of this protocol to DFT-optimized structures we computed that fully relaxed both the atom-in-material positions and the unit cell's size and shape.Upon further investigation, we came to believe that the DFT-optimized lattice vectors (which determine the unit cell's size and shape) were less accurate than the experimentally measured lattice vectors for these materials.Technically speaking, quantum chemistry calculations do not generate rigorously correct optimized lattice vectors because of the Pulay stresses due to basis set incompleteness. 111While using an extremely large basis set with a ne k-point mesh can mitigate this issue, [111][112][113] quantum chemistry calculations near the complete basis set limit are computationally expensive.For these reasons, we believe it is usually preferable to construct the reference geometry by allowing the atomic positions to relax during geometry optimization with the unit cell volume and shape held xed at the experimental values.Accordingly, all computational results presented in this article were obtained using the experimental lattice vectors.
Using the experimental lattice vectors involves three caveats.First, some materials have not been characterized experimentally yet.For these new materials, experimentally-measured lattice vectors are not available, and one should instead use the quantummechanically-computed lattice vectors (this case did not arise for any materials in this study).Second, as pointed out by one of the anonymous reviewers of this article: "experimental characterization of MOFs oen takes place on solvated structures, so the experimental values do not always pertain to the more relevant empty/activated structures".Third, experimental characterization oen takes place at room temperature while the electronic groundstate structure should technically correspond to a temperature of absolute zero.In spite of these three caveats, it is still true that oen the experimentally-measured lattice vectors have smaller uncertainties and smaller errors than their quantummechanically-computed counterparts.The important principle is to use whichever lattice vectors are more accurate: the experimental ones or the quantum-mechanically-computed ones.

Ab initio molecular dynamics and nite-displacement 'Hessian' calculations
To achieve a comprehensive sampling of all internal motion modes, we employed a combination of ab initio molecular dynamics (AIMD) and nite-displacement Hessian calculations.AIMD simulations provide information about larger random displacements, while Hessian calculations systematically sample every degree of freedom using small nite displacements.By combining these techniques, we achieved a more rigorous sampling that includes both small nite displacements of every mode as well as some larger displacements of randomly selected modes.Ten AIMD runs were performed for each structure to generate training set data.Another 10 AIMD runs per structure were performed to generate validation set data.The forces and geometries were extracted and assembled into a csv le.Then, these csv les were read into a python program that generates the exibility parameters through leastsquares regression as described in Section 7 below.
For the AIMD calculations, the number of geometry steps per run was set to 100 starting from the optimized geometry.The forces were calculated as a response to the changes in atom positions while keeping the cell shape and volume constant.A time step of 1 femtosecond was used with a starting temperature of 300 K and a microcanonical (NVE) ensemble.This corresponded to the following VASP settings: IBRION = 0 (chooses molecular dynamics), NSW = 100 (chooses 100 geometry steps), ISIF = 0 (chooses xed cell volume and shape while atoms move), MDALGO = 0 and SMASS = −3 (chooses NVE ensemble), POTIM = 1 (chooses 1 femtosecond time step), TEBEG = 300 (chooses starting temperature).
The Hessian matrix was computed using a nite difference method with a displacement size of 0.07 Å and four displacements per direction.Specically, the atomic positions were displaced by −0.14 Å, −0.07 Å, 0.07 Å, and 0.14 Å along each of the x, y, and z axes, resulting in a total of 12 displacements per atom.This corresponded to the following VASP settings: NSW = 1, ISIF = 0, IBRION = 5, POTIM = 0.07, NFREE = 4.

Rotatable dihedral single-point calculations
To explore the potential energy associated with the rotation of certain dihedrals, single-point (i.e., rigid geometry) calculations were carried out using the common VASP settings (Section 6.1) as follows.Within each rotatable dihedral type, a single dihedral was randomly selected and rotated in 10°increments from a dihedral value of −170°to 180°.The procedure resulted in 36 different geometries for the selected dihedral.The process was repeated for each rotatable dihedral type in the MOF.This generated energy versus dihedral value curves that were analyzed as described in Section 7 below (As shown in ESI Section S6, † we performed a test in which torsion scan curves were generated for every instance of a randomly chosen rotatable dihedral type.All of those torsion scan curves were identical.More generally, if two instances of the same type have different signs for f eq , then one would have torsion scan curves for these two instances that are mirror images of each other; since this case is automatically handled by Manz's torsion model potentials, it does not require generating separate torsion scan curves for these two instances.This validates the method of generating a torsion scan curve for one instance of each rotatable dihedral type.).
We prepared these rigid torsion scan geometries using Rodrigues' rotation formula: where û is the axis of rotation, q rot is the angle of rotation, w is the vector before rotation, and wrotated is the vector aer rotation.The rotation angle was set as where f desired is the desired value of f ABCD and f eq is the value of f ABCD in the reference geometry (i.e., the quantummechanically-optimized ground-state geometry using the experimental lattice vectors without any dihedral constraints).
Let RA be the position of atom A in the (0,0,0) unit cell.Let (A,L A 1 ,L A 2 ,L A 3 ) denote a translated atom A image whose position is whereṽ ð1Þ ,ṽ ð2Þ , andṽ ð3Þ are the unit cell's lattice vectors.Analogous formulas hold for all other atom images.For example, ṽð3Þ is the position of atom image (B,L B 1 ,L B 2 ,L B 3 ).For a rotatable dihedral ABCD in the extended structure (bonded_group_A)BC(bonded_group_D), we computed whether the bonded_group_A emanating from atom A was smaller than the bonded_group_D emanating from atom D. If bonded_-group_A contained fewer atoms than bonded_group_D, then atom image B was chosen as the origin (i.e., start !¼ G ! B ) for the rotation; otherwise, atom image C (i.e., start !¼ G ! C ) was chosen as the origin for the rotation.The axis of rotation, û, is the unit vector parallel to the middle bond BC ! and pointing towards the chosen origin; that is, pointing along the direction from C to B if bonded_group_A contained fewer atoms than bonded_group_D, otherwise pointing along the direction from B to C. w is the vector from the chosen origin to a particular atom image E in the bonded group being rotated, as computed in the reference geometry: If bonded_group_A contained fewer atoms than bonded_-group_D, then bonded_group_A is the group being rotated; otherwise, bonded_group_D is the group being rotated.The position of atom image E aer the dihedral rotation is By converting G ! rotated E to fractional coordinates and then converting the decimal part of the fractional coordinates back to real space, the position Rrotated E of the rotated atom E within the (0,0,0) unit cell can be computed from G ! rotated E .This process was repeated for each and every atom in the bonded group being rotated to nd their new positions; the positions of all other atoms were the same as in the reference geometry.
7. Least-squares fitting to extract the flexibility parameters

Smart selection of rotatable dihedral potential modes
As described in Section 6.4 above, we quantum-mechanicallycomputed energies, E QM RTS [f], along a rigid torsion scan (RTS) for one rotatable dihedral instance in each rotatable dihedral type.Along this curve for a particular dihedral instance, the geometries must be equally spaced in dihedral value over the range (−p,p].For example, we used T = 36 geometries with equally spaced dihedral values f= −170°, −160°, .,170°,180°.Along this curve for a particular dihedral instance, the average energy is and the self-overlap integral is: As described in the companion article, the rst seven independent torsion modes have the following orthogonal basis functions when the average potential of each torsion mode has been shied to zero: 50 This allows E QM RTS [f] be approximated by the following model where the coefficients for each mode are dened as 50 The goodness of t (aka R-squared value) for this RTS model equals the sum of squared coefficients where the sum is performed over all modes included in the model. 50n this work, we considered mode m to be active iff abs[c m ] > 0.1; in this case, we included mode m in the subsequent exible forceeld model.If abs[c m ] # 0.1, the mode was considered inactive and not included in the subsequent exible forceeld model.We call this process 'smart selection of rotatable dihedral potential modes'.Consequently, the torsion potential for a rotatable dihedral type can be represented as a linear combination of one to seven modes.The goodness of t (aka Rsquared value) for this 'smart-selected' RTS model equals the sum of squared coefficients: where the sum is performed over the selected modes included in the model.Fig. 13 plots examples of rigid torsion scans for rotatable dihedrals in selected MOFs.The top panels display the results for MOFs with only one rotatable dihedral type, while the bottom panels display the results for one MOF with two rotatable dihedral types.The R-squared values close to one show the models performed well.

Linear equations for exibility parameters
Our linear regression problem contains two kinds of observation variables in the combined training dataset: (a) quantummechanically-computed atom-in-material forces and (b) quantum-mechanically-computed total energies.The quantummechanically-computed atom-in-material forces are from the QM-optimized geometry, AIMD geometries, and nitedisplacement 'Hessian' geometries.There are a total of f_rows = 3N atoms N force_geoms (56)   force components in the forces training dataset.
The predictor variables also contain two sets of data: one for forces tting and the other for the rotatable dihedrals tting.Let M be a matrix containing values of the predictor variables.This leads to the following linear model: where i is the observation index and j is the model parameter index.In our case, each model parameter is a force constant for a exibility term: The total number of attempted force constants in the model is p.Here, the term 'total number of attempted force constants' refers to the number of exibility terms (i.e., number of force Paper RSC Advances constants) that were 'attempted' before any of these were zeroed out by the bounds or regularization constraints.
Because the atom-in-material forces for the AIMD-generated geometries depend on all of the force constants while the RTS energies depend only on the rotatable dihedral force constants, it follows that the predictor variables matrix M has the form where N AFCs is the total number of attempted force constants.N rd_AFCs is the number of rotatable dihedral attempted force constants.
To dene the M1, we need to start from the following relation: Comparing eqn ( 58)-( 63) reveals that where m is the geometry number.x= 1, 2, 3 for x, y, z force components of atom g, respectively.{a eq h } are the equilibrium values of the internal coordinates.The derivatives in eqn ( 64) can be computed either numerically (using nite difference approximation) or analytically.
For the predictor variables in rotatable dihedrals tting, as discussed in Sections 2.2 and 7.1, we can have up to seven active modes for each rotatable dihedral.We rst determine which dihedral modes are active for each rotatable dihedral type using the method described in Section 7.1.Since the forces are zero at the equilibrium geometry, the no-intercept linear regression model is used.Therefore, to be able to use a no-intercept model, we centered the observation variable (i.e., the QM-computed energy) for rotatable dihedral torsions by subtracting the average value as described in ESI Section S3 † to construct the matrix M2.Please see ESI Section S3 † for a more detailed description of linear equations for exibility parameters.To assess the predictive power of our exibility model, we employed two metrics: the R-squared (aka 'goodness of t') and the root-mean-squared error (RMSE).R-Squared is calculated using the formula: Fig. 13 Potential energy curves for rigid torsion scans of rotatable dihedrals.In each panel, the black dots show the quantum mechanical energy obtained from single-point DFT_with_dispersion calculations.The orange curve illustrates the fitted model using all 7 modes, while the green curve shows the fitted model using only the smart-selected modes.
The sum of squared errors (SSE) and sum of squares total (SST) for rotatable dihedrals tting and forces tting are dened as follows: is the ith predicted force component.RMSE rot_dihedrals and RMSE forces are computed as follows: Applying eqn ( 65), ( 70), (71), and ( 73 Whether or not the MOF has rotatable dihedrals, the validation dataset contains quantum-mechanically-computed forces for the optimized ground-state geometry plus new AIMD-generated geometries (we included the material's optimized ground-state reference geometry in both the training and validation datasets in order to validate that the trained forceeld yields zero-valued atom-in-material forces for this optimized geometry).The AIMD-generated geometries for the validation dataset are taken from separate AIMD runs than those used to prepare the forces training dataset.When computing statistics for the validation dataset, the b matrix (i.e., set of force constants values) applied is the one that was previously optimized to the combined training dataset.For the validation dataset, the model-predicted forces follow eqn (69), where M1 is constructed by applying eqn (64) to the validation dataset geometries.Applying eqn ( 65), ( 70), (71), and ( 73) to the forces validation dataset yields SSE validation , SST validation , R validation 2 , and RMSE validation .When analyzing the validation dataset, f_rows = 3N atoms N validation_geoms is the number of force components in the validation dataset, where N validation_geoms is the number of geometries in the validation dataset.
7.4 Embedded feature selection using the LASSO method with bounds on some force constants A common issue in tting parameters using ordinary least squares regression is multicollinearity.When two or more predictors in the linear model are highly correlated to each other or not linearly independent, this is known as multicollinearity. 114In this case, the Gram matrix M T M contains one or more singular values that are zero or close to zero. 114In this case, there are more predictors than needed to build the model.Embedded feature selection is needed to select an appropriate subset of predictors for model building.The LASSO method solves these two problems (i.e., multicollinearity and embedded feature selection) by adding a L 1norm penalty term to the least-squares loss function. 65,66pecically, the LASSO method zeroes out coefficients of unnecessary predictors; this reduces the number of predictors required to build a useful model. 65,66,114Alternatively, ridge regression 115 solves the multicollinearity problem by adding a L 2 -norm penalty term to the least-squares loss function; however, ridge regression does not solve the embedded feature selection problem.Accordingly, we decided to use the LASSO method rather than ridge regression to optimize the force constants for exibility interactions.We used the Python version of the glmnet package 116 which minimizes the following loss function: Here, i is the observation index and j is the predictor variable index.For LASSO regression, a = 1.In contrast, a = 0 for ridge regression.We used LASSO regression.N is the number of observations (i.e., the number of rows in matrix M) and p is the number of predictor variables (i.e., the number of columns in matrix M). l $ 0 is the regularization parameter.u i is the observation weight.n j is the jth variable's penalty factor.h j is the Lagrange multiplier to enforce bounds on the model parameter b j .The optimized model parameter values, {b j }, minimize the loss function's value: If the MOF contains any rotatable dihedrals, we dened the training dataset observation weights as follows.
(By convention, Glmnet_Python rescales the observation weights so that they sum to N. 117 This does not change their ratios, the optimized {b j }, the R-squared values, or the RMSE (as dened in eqn ( 72) and ( 73)).)SSE forces (eqn ( 70)) and SSE rot_dihedrals (eqn ( 67)) can be rewritten as follows: By dening u i as described in eqn ( 76), setting a = 1, and by substituting eqn ( 77) and ( 78) together with the R-squared denition (eqn ( 65)) into ( 74), the loss function can be rewritten as follows: where We dened the jth predictor variable's penalty factor as follows: (By convention, Glmnet_Python rescales the penalty factors so that they sum to p. 117 This does not change their ratios, the optimized {b j }, the R-squared values, or the RMSE (as dened in eqn ( 72) and ( 73)).)We chose this denition, because it makes the model invariant to the choice of measurement units.For example, whether the distance between two atoms is measured in either Angstroms (Å) or bohrs, the optimized model still gives the same optimized {b j }, R-squared values, and RMSE (as dened in eqn ( 72) and ( 73)).Proof: (1) M ij b j has the same units as Y QM i .
(3) Thus it follows that is dimensionless.( 4) Since jb j j has the same units as b j , it follows that the penalty factors dened by eqn (81) give optimized R-squared values (i.e., R forces_training 2 , R rot_dihedrals 2 , and ) that are independent of the measurement units.
lb j and ub j provide a lower bound and an upper bound on the model parameter b j : We used no upper bound (i.e., ub j / innity).We used the lower bound of zero to constrain all bond stretches, UB stretches, angle bends, non-rotatable/hindered dihedral torsions, and ADDT linear torsion modes to be non-negative.This guarantees that displacements along those internal coordinates away from the equilibrium (aka optimized) structure leads to energy increases in the model forceeld.For rotatable dihedrals, the lower bound was set to zero iff only one mode was smart selected, because a single rotatable dihedral mode cannot exhibit competing signs.When a rotatable dihedral has more than one mode that is smart selected, no lower bound on the corresponding force constants was imposed (i.e., lb j was set to minus innity) because some modes having a negative force constant could be compensated by other modes having a positive force constant to still produce an energy increase when the geometry is displaced away from the equilibrium structure.The bond-bond cross terms determine the relative energy of symmetric stretches compared to asymmetric stretches.The lb j was set to minus innity for the bond-bond cross terms, because depending on the situation symmetric stretches may be either higher or lower in energy than asymmetric stretches.By default, the glmnet package assigns and uses a series of 100 distinct l values in a geometric progression from l min to l max . 116As an input to the glmnet function, the user species the desired ratio of l min /l max . 116We used the ratio 10 −5 .Glmnet automatically determines the l max value, which is the smallest value of l that sets all force constants to zero. 116If the smallest value of l is too close to zero, then it will not sufficiently regularize the multicollinearity problem. 65,66Therefore, we must use a l min that is small but not too close to zero. 65,66ext, we tried to nd the best l among the generated l sequence to have our nal linear model parameters.Each l will give us a set of model parameters and by increasing the l value we will have lower number of non-zero parameters and smaller R-squared.To generate a selection criterion, we rst need to estimate the number of degrees of freedom in the physical system.For a material containing N atoms in the reference unit cell in the presence of optional externally applied elds, there are 3N atoms degrees of freedom in the atomic positions.In the absence of externally applied elds, this may be reduced by 0 to 5 degrees of freedom due to center-of-mass translational symmetry and/or rotational symmetry about the center of mass.However, the precise reduction in degrees of freedom depends on whether the system is periodic or nonperiodic, whether the unit cell parameters and symmetry can be changed, and whether a molecule is linear or nonlinear or monoatomic.For simplicity, we neglect this degrees of freedom reduction (due to the absence of externally applied elds) and simply use the 3N atoms degrees of freedom.
A force constant should be kept in the exible forceeld iff excluding it would increase the SSE by more than half the formal amount per degree of freedom.Therefore, we used the following test to identify l best : N k_remaining is the number of non-zero parameters in the model (the change in SSE formally corresponding to one degree of freedom corresponds to the le-hand side of eqn ( 84) being equal to one).Substituting eqn ( 65) into (84) gives: Note that SST does not depend on N k_remaining .Eqn (85) was evaluated using nite difference approximation: Therefore, we started with the smallest l in the l sequence, which also corresponds to the largest R-squared with highest number of non-zero parameters.As mentioned earlier, as l increases, the R-squared value tends to decrease, while the number of non-zero parameters may remain the same.If we have the same number of non-zero parameters for different l values, we choose the smallest l among these that yields the highest R-squared.Next, we compare the results obtained with our selected l value with the next higher l value, which has a different (lower) number of non-zero parameters.We also ensure that the second chosen l value generates the highest Rsquared among l values having the same number of non-zero parameters.Therefore, we proceed with our test until we reach a step where the condition dened in eqn (86) is no longer satised.If l a represents the smaller l and l b is the l value that violates the test condition, we identify l a as our l best .Then, using l best , we generate our linear model parameters (i.e., the optimized force constant values).ESI Section S4 † contains the python function we employed to identify l best .
In the glmnet package, 116 we also specied the following options: family = 'Gaussian' (this corresponds to linear leastsquares tting), thresh = 10 −10 (convergence threshold for updating model parameters in each optimization iteration), standardize = False (logical ag for independent variables standardization), standardize_resp = False (logical ag for response variables standardization), intr = False (logical ag to add or remove intercept from linear model; assigning "False" to this parameter means we are using a no-intercept linear model).

Classifying the MOFs into four quadrants
As explained in Sections 5.4.2 and 5.4.4,we classied each dihedral as non-rotatable, rotatable, hindered, or linear.Since all MOFs contain at least one dihedral, each MOF can be clas-sied into a quadrant depending on whether it contains any rotatable dihedrals and/or any hindered dihedrals.Quadrant 1 includes MOFs having no rotatable dihedrals and no hindered dihedrals; each MOF in quadrant 1 contains only non-rotatable and/or linear dihedrals.Each MOF in quadrant 2 contains at least one hindered dihedral, no rotatable dihedrals, and any number of non-rotatable and/or linear dihedrals.Each MOF in quadrant 3 contains at least one rotatable dihedral, no hindered dihedrals, and any number of non-rotatable and/or linear dihedrals.Each MOF in quadrant 4 contains at least one hindered dihedral and at least one rotatable dihedral and any number of non-rotatable and/or linear dihedrals.
A dataset comprising 116 MOFs successfully passed the crystal geometry verication procedure outlined in Section 4. As described in Section 6, we performed quantum chemistry calculations on these MOFs.No MOFs were excluded from the dataset during or aer exibility parameters optimization.For this dataset, Table 1 lists the number of MOFs in each quadrant.

MOF sizes and chemical element compositions
Fig. 14 shows the distribution of unit cell sizes as quantied by the number of atoms per unit cell.The most prevalent range was 100-199 atoms per unit cell.The largest and smallest MOFs in our investigation contained 496 and 38 atoms per unit cell, respectively.
We identied a total of 23 distinct chemical elements present within these structures.Fig. 15 shows the frequency of occurrence for each of these 23 elements across the 116 MOFs.Every MOF within the dataset had both carbon (C) and hydrogen (H) atoms.Paper RSC Advances

Bond, angle, and dihedral types
In this section, all of the plotted data corresponds to the nal internal coordinates list that follows all adjustments, such as removal of angles in 3-and 4-membered rings and dihedral pruning.Moreover, all MOFs in the relevant quadrants were included in these plots.Fig. 16 shows stacked histograms of the number of MOFs containing various numbers of active internal coordinate types (le panels) and active internal coordinate instances (right panels).The number of active angle bend types was usually larger than the number of active bond plus UB stretch types.Aer dihedral pruning, the number of active angle bend types was usually larger than the number of active dihedral torsion types.Overall, the numbers of internal coordinate instances per MOF were much larger than the numbers of internal coordinate types per MOF.This clearly demonstrates the effectiveness of grouping similar instances into the same type to reduce the number of force constant values that have to be optimized.
Section S7 of the ESI † contains histograms showing the distribution of number of internal coordinate instances per internal coordinate type.The total number of stretch, angle, and dihedral types was 2327, 6358, and 3740, respectively.These stretches included both the bond stretches and the UB stretches.These distributions peaked at 6-8 instances per stretch type, 4-5 instances per angle type, and 4-5 instances per dihedral type.
The histogram in the le panel of Fig. 17 shows the distribution of rotatable dihedral types per MOF aer dihedral pruning.Of the 116 MOFs we studied, 79 had no rotatable dihedrals corresponding to MOFs in quadrants 1 and 2. The other 37 MOFs had at least one rotatable dihedral and represent quadrants 3 and 4. Each of the 37 MOFs had fewer than 20 rotatable dihedral types aer pruning, with most MOFs in these quadrants having between 1 and 9 rotatable dihedral types.The histogram in the right panel of Fig. 17 shows the distribution of rotatable dihedral instances per MOF aer dihedral pruning.The 37 MOFs in quadrants 3 and 4 had fewer than 80 rotatable dihedral instances aer pruning, with most MOFs in these quadrants having between 1 and 19 rotatable dihedral instances.
Table 2 lists the number of dihedral instances, number of dihedral types, and the number of MOFs that used ve different kinds of dihedral torsion model potentials: CADT nonrotatable/hindered, ADDT non-rotatable/hindered, CADT rotatable, ADDT rotatable, and ADDT linear.These model potentials are explicitly dened in Section 2.2 above.Examining Table 2, the vast majority of active dihedrals used the CADT non-rotatable/hindered model potential.This is the simplest and computationally cheapest among the ve model potentials.The ADDT rotatable model potential is the most computationally expensive among the ve model potentials, and it was used the least oen.Moreover, the CADT rotatable and ADDT rotatable model potentials are used along with smart selection to ensure the computational cost is minimized by including only important torsion modes.Overall, this strategy provides an extremely general, concise, and cost-effective approach.The following sections show this strategy is also extremely accurate.

Internal coordinate redundancy
The number of degrees of freedom for atom-in-material motions in a crystal having xed unit cell size and shape can be derived as follows.Each atom can be moved along x, y, and z directions; this gives 3N atoms motions.In the absence of externally applied elds, the total potential energy is unchanged by uniform translation so this means 3 motions do not affect the material's potential energy function.Hence there are (3N atoms − 3) independent degrees of freedom in the material's internal coordinates when keeping the unit cell's size and shape rigid.
The internal coordinate redundancy (ICR) is therefore dened as Fig. 15 A stacked histogram showing the number of MOFs containing various chemical elements (If a particular MOF had more than one atom of a particular chemical element, this counted only once.For example, a MOF with six Zn atoms counts as 1 towards the Zn bin.).Elements not shown in this graph were not contained in any of these 116 MOFs.
In eqn (87), num_active_internal_coords is the number of active internal coordinates instances; that is, the number of internal coordinates instances that are used to construct any interactions in the exibility model.If the ICR is negative, this means the active internal coordinates list contains fewer instances than there are degrees of motion freedom.If the ICR is positive, this means the active internal coordinates list contains more instances than there are degrees of motion freedom.What is the 'best' ICR value?At rst, we may think that zero ICR is the 'best' value, because it means there are exactly the same number of instances in the active internal coordinates list as there are degrees of motion freedom; however, this means the exibility model does not self-correct for any over-simplications in the model potentials.If ICR is greater than zero, then this provides some malleability for the model to partly self-correct for any oversimplications in the model potentials.However, too much redundancy is also a disadvantage because it means the exibility model contains a relatively large number of interaction terms, and this leads to relatively high computational costs when using the model.Therefore, the 'best' ICR value is a modest positive percentage (e.g., ∼20-60%) that provides some malleability for the exibility model to partly self-correct for any oversimplications in the model potentials while still keeping the computational costs relatively low.Fig. 18 shows a stacked histogram of ICR for all 116 MOFs aer dihedral pruning.When computing these values, the active list of internal coordinates instances included the bond stretches, the angles not in 3-membered or 4-membered rings, UB stretches composed from the diagonals of 4-membered rings, and the dihedrals active aer pruning.Examining Fig. 18, 30-39% redundancy was the most popular interval.When applying our protocol, the ICR was less than zero for none of these 116 MOFs.Fig. 18 shows that our protocol yielded 20-69% redundancy for the vast majority of systems investigated.Our protocol yielded ICR larger than 100% for only 2 of the 116 MOFs, and ICR values of 0-19% for only 2 of the 116 MOFs.Overall, this shows our protocol worked well.

Smart mode selection for rotatable dihedrals
All results in this section are for calculations following dihedral pruning.The le panel of Fig. 19 is a histogram showing the number of smart-selected torsion modes in each rotatable dihedral type.For ∼60% of the rotatable dihedral types, smart selection yielded a model potential containing one torsion mode per rotatable dihedral type.For example, molecular symmetry reveals the torsion potential in ethane is closely described by the single mode G CADT mode_3 [f ABCD ], and torsion modal analysis conrms this. 50If ethane's rotatable dihedral was a rotatable dihedral type in a MOF, it would be plotted in the bar labeled '1' in the le panel of Fig. 19, because only a single mode is required to describe this torsion potential.Smaller percentages of rotatable dihedral types required two (∼22%), three (∼9%), four (∼6%), ve (∼1%), six (∼1%), or seven (∼0%) torsion modes per rotatable dihedral type.
The right panel of Fig. 19 is a histogram showing which rotatable dihedral modes were smart selected.Mode 3 was the most popular mode, and it appeared in the smart-selected torsion potential for 73 (∼74%) of the rotatable dihedral types.Mode 2 was the second-most popular followed by mode 1 as the third-most popular torsion mode for rotatable dihedral types.Modes 5, 6, and 7 were less popular but appeared in the smart-selected torsion potential for signicant numbers of rotatable dihedral types.Mode 4 was the least popular and was not signicant in any of the 116 MOFs analyzed in this work.
Notably, the torsion sine modes (i.e., modes 5, 6, and/or 7) cannot be the only smart-selected modes for a rotatable dihedral type.This follows directly from the observation that the torsion sine modes are odd functions of (f − f eq ); these modes increase the potential energy for a small displacement of f in  Fig. 18 Histogram showing the internal coordinate redundancy after applying our protocol to 116 MOFs.The plotted data corresponds to the final internal coordinates list that follows all adjustments, such as removal of angles in 3-and 4-membered rings and dihedral pruning.
one direction away from f eq while decreasing the potential energy for a small displacement in the opposite direction.In contrast, the torsion cosine modes (i.e., modes 1 to 4) are even functions of (f − f eq ); these modes increase the potential energy for small displacements of f in either direction away from f eq .Since f = f eq is a local energy minimum, it directly follows that the smart-selected torsion modes for a rotatable dihedral type must include at least one torsion cosine mode.
8.6 Regularized linear least squares tting results 8.6.1 Comparing results before to aer dihedral pruning.This section contains several plots comparing performance before dihedral pruning to performance aer dihedral pruning for all of the MOFs in quadrant 1.As shown in Fig. 20, dihedral pruning eliminated approximately two-thirds of the dihedral instances leaving the other one-third aer pruning.Except for a couple of outliers, more than half of the dihedral instances for each MOF were consistently eliminated via dihedral pruning.
As shown in Fig. 21, dihedral pruning consistently reduced the internal coordinate redundancy percentage.Before dihedral pruning, the internal coordinate redundancy was >100% for most (but not all) MOFs.Aer dihedral pruning, the internal coordinate redundancy was 20-69% for most (but not all) MOFs.
In this manuscript, we report separate least-squares regression results using individual equilibrium values and average equilibrium values.Using 'individual equilibrium values' means that each instance of each active internal coordinate uses exibility terms employing the specic equilibrium value for that particular instance as taken from the material's quantum-mechanically-computed ground-state geometry (as stated previously, we held the unit cell's size and shape xed at the experimental values).Using 'average equilibrium values' means the bond lengths, angle values, or dihedral absolute values were averaged over all instances belonging to each active internal coordinate type.This yielded an 'average equilibrium value' for each internal coordinate type that was subsequently used as the equilibrium value in all exibility terms involving that internal coordinate type.Why do we report separate results using the 'individual equilibrium values' and the 'average equilibrium values' instead of choosing only one?In our experience, computing both is extremely valuable for diagnostic purposes.Consider a scenario in which R-squared values for regression using individual equilibrium values are much higher than R-squared values for regression using average equilibrium values.This scenario could indicate a situation in which an internal coordinate type was dened too loosely and included instances that differ too much from each other.
Fig. 22 shows a stacked histogram of the difference between the validation dataset R-squared for force constants regression performed using internal coordinate lists without (aka 'before') or with (aka 'aer') dihedral pruning (all R-squared and RMSE statistics for the validation datasets in this article used force constants optimized using l best values).Both distributions (i.e., using average and individual equilibrium values) peaked at a Rsquared difference of 0.025-0.03.Fig. 22 shows dihedral pruning always reduced the R-squared values by a small (i.e., 0-0.045) amount.
In Fig. 23, histograms present the distribution of number of k's before (top panels) and aer (bottom panels) dihedral pruning using average (le panels) and individual (right panels) equilibrium values for each internal coordinate type for MOFs in quadrant 1. Notably, a signicant portion (∼30%) of the k's representing non-rotatable or hindered dihedrals were eliminated by the LASSO method in the before dihedral pruning case.In the aer dihedral pruning case, the LASSO method removed a smaller portion (∼10%) of non-rotatable or hindered dihedral k's.
As shown in Fig. 24, the computational time for exibility parameters calculation (using our SAVESTEPS soware) was approximately cut in half by dihedral pruning.Overall, the results in this section showed dihedral pruning consistently and substantially reduces the number of active dihedral instances, internal coordinate redundancy, number of force constants that need to be optimized, and the computational time, while causing only a small (i.e., 0-0.045) reduction in Rsquared values.These results clearly show our dihedral pruning protocol was highly effective.
Additional aer-pruning LASSO results for MOFs without rotatable dihedrals are shown in Section S8 of the ESI.† Except Fig. 23 Histograms of force constants eliminated by the bounds or regularization constraints in the LASSO method applied to MOFs in quadrant 1.These histograms show results before dihedral pruning (top panels) and after dihedral pruning (bottom panels) using average (left panels) and individual (right panels) equilibrium values without bond-bond cross terms.for the absence of rotatable dihedrals, these results are similar to those presented in the next section for MOFs with rotatable dihedrals.
8.6.2LASSO results for MOFs with rotatable dihedrals.All results in this section are for calculations following dihedral pruning and rotatable torsion mode smart selection.Histograms showing the difference between the R-squared value for l / 0 compared to l = l best are shown in Fig. 25.This R-squared difference was almost negligible for both the rotatable dihedrals training dataset and the forces training dataset.
Fig. 26 shows the number of force constants eliminated by the LASSO method for l / 0 and the additional number that were eliminated when increasing l to l best .Some of the k's eliminated for l / 0 may have been eliminated by the nonnegative bounds placed on stretches, bends, and single-mode torsions, while others may have been eliminated due to linear dependencies between the exibility terms.The k's eliminated when increasing l to l best also correspond to unimportant exibility terms that contribute negligibly to the model's accuracy.Results are presented for calculations with and without bond-bond cross terms.
Examining Fig. 26, the exibility models containing bondbond cross terms had approximately 50% more force constants on average than the exibility models without bond-bond cross terms.A small percentage of the bond-bond cross terms were eliminated by the LASSO method.Accordingly, including bondbond cross terms increases the computational costs to use the exibility model.
All subsequent results in this section used l best .Because including bond-bond cross terms resulted in only a small improvement in R-squared value (see Fig. 27), this suggests including bond-bond cross terms is not essential to obtain useful exibility models for these MOFs.
In this article, each raincloud plot contains a box plot below the kernel density plot ('cloud').All box plots in this article have the following features.The middle line is the median.The box encloses the 2nd and 3rd quartiles.The whiskers mark the 5th and 95th percentiles.The diamonds show individual outliers.These raincloud plots also show all of the individual data points as jittered points ('raindrops').     3 summarizes training and validation statistics for MOFs in quadrants 1 and 3.All of the average R-squared values were >0.90, and all of the Rsquared standard deviations were #0.02.This clearly shows our method worked extremely well and performed consistently across different materials.
The average R-squared values for forces training and validation were slightly higher for three conditions: When using the individual equilibrium values compared to using average equilibrium values.
Without dihedral pruning compared to with dihedral pruning.
With bond-bond cross terms compared to without bondbond cross terms.However, in all three comparisons the differences in average R-squared values were small.This means either individual or average equilibrium values can be used in the exible forceeld according to the user's preference with little change in accuracy.We also conclude that dihedral pruning effectively reduced the forceeld's computational cost while causing only a small reduction in its accuracy.Finally, bond-bond cross terms do not appear to be essential in most cases.
For each calculation type, the average R-squared value for the validation dataset was approximately the same as (but not strictly equal to) the average R-squared value for the forces training dataset.This clearly shows our method does not have any over-tting problems.
The slightly higher average RMSE values for the validation dataset compared to the forces training dataset is due to the inclusion of nite-displacement 'Hessian' geometries in the forces training dataset.In each nite-displacement 'Hessian' geometry, only one atom is displaced away from its position in the optimized ground-state geometry.In contrast, all atoms are moved in AIMD-generated geometries.Consequently, the average root-mean-squared value of each force component in    The R-squared value for rotatable dihedrals training was 0.988 (average) ± 0.010 (standard deviation) irrespective of whether bond-bond cross terms were included and irrespective of whether average or individual equilibrium values were used.This high average R-squared value and small standard deviation clearly prove the exibility model consistently described the rigid torsion scan energies with extremely high accuracy.The RMSE values were small: 0.022 (average) ± 0.024 (standard deviation) eV.In this case, the standard deviation was larger than the average RMSE, because the average RMSE was relatively small in magnitude.

Performance statistics for individual atoms in a material
For further analysis, we selected two MOFs that had the lowest validation R-squared values.Among MOFs which had no rotatable dihedrals (i.e., quadrants 1 and 2), OGIBUD had the lowest validation R-squared value.Among MOFs with at least one rotatable dihedral (i.e., quadrants 3 and 4), HEBZEV had the lowest validation R-squared value.Table 4 summarizes performance statistics for these two MOFs.
To gain further insights, our SAVESTEPS Python code automatically computed and printed the atom-wise R-squared and atom-wise RMSE statistics for each atom in the material.These were computed and printed for both the forces training and validation datasets.This helps to identify whether the exibility model performed poorly for specic atoms in the material.
Raincloud plots help visualize this data.There are four scenarios: Scenario # 1: there are no small R-squared values and no large RMSE values in these raincloud plots.This means the exibility model gave small relative errors and small absolute errors when predicting atom-in-material force components for individual atoms in the material.Scenario # 2: there are some small R-squared values but no large RMSE values in these raincloud plots.This means the exibility model gave large relative errors but small absolute errors when predicting atom-in-material force components for some of the atoms having small root-mean-squared forces.
Scenario # 3: there are no small R-squared values but there are some large RMSE values in these raincloud plots.This means the exibility model gave small relative errors but large absolute errors when predicting atom-in-material force components for some of the atoms having large root-meansquared forces.
Scenario # 4: there are both small R-squared values and large RMSE values for some of the atoms in these raincloud plots.This is only a problem if a small atom-wise R-squared value and a large atom-wise RMSE value occurred for the same atom.In this case, the exibility model gave large relative errors and large absolute errors when predicting this atom's forces.
Scenarios # 1, 2, and 3 suggest the exibility model performed acceptably, because either small relative errors or small absolute errors are acceptable.On the other hand, scenario # 4 may indicate the exibility model performed poorly and needs to be improved.
What constitutes 'small' and 'large' values is a judgement call.An atom-wise R-squared value less than 0.5 could be considered 'small'.An atom-wise RMSE value could be considered relatively large if it is larger than ve times the median value.
Fig. 29 shows raincloud plots for the atom-wise R-squared and atom-wise RMSE values for the validation datasets of OGIBUD and HEBZEV.Close examination of this gure indicates Scenario # 1 when using individual and average equilibrium values for OGIBUD, and Scenario # 2 when using individual and average equilibrium values for HEBZEV.Accordingly, the exibility models for these two MOFs performed acceptably.

Computational time and memory
Computational time and memory can vary somewhat depending on the computing architecture and setup conditions.Even with this caveat, we believe it is useful to include this type of data here, because it provides some guidance for planning purposes.Potential users of our new method will likely want to know how much computing resources it could potentially require to optimize exibility parameters for large material databases containing tens of thousands of materials.
The computational times plotted in Fig. 30 include optimizing force constant values, computing statistics for the training datasets, and computing statistics for the validation datasets.These computational times do not include the times for quantum chemistry calculations to prepare the training and validation datasets.The plotted computational times are for running our SAVESTEPS Python code on a single computing core (i.e., serial computation) on the Expanse cluster at the San Diego Supercomputing Center (SDSC) (The Expanse cluster has AMD EPYC 7742 "Rome" processors.)As shown in Fig. 30, these computational times ranged from 0.17 to 32 hours.The required computational time scaled approximately quadratically (i.e., observed effective exponent between 1.70 and 1.85) as the number of atoms in the MOF's unit cell increased.Table 5 summarizes overall computational costs for: (i) quantum chemistry calculations (using VASP) to compute the training and validation datasets and (ii) exibility parameters calculation using our SAVESTEPS Python code.For testing, a MOF with rotatable dihedrals and a MOF without any rotatable dihedrals were chosen in each of ve different N atoms intervals: [1,99], [100,199] where cores j is the number of computing cores for job j and walltime j is the elapsed wall time from the start of job j to its completion.In eqn (88), the sum is over all jobs required to complete items (a) through (e) listed above.The 'peak memory' for these quantum chemistry calculations was dened as VASP printed the maximum memory used per core (i.e., max_mem_per_core j ) in the output le for each job j.In eqn (89), peak memory represents the largest memory that was used for any job to complete items (a) through (e) listed above.These VASP quantum chemistry calculations were performed on the Expanse cluster at SDSC, the Stampede2 cluster at the Texas Advanced Computing Center (TACC), and/or the Frontera cluster at TACC.We used 48 cores (a partial node) for each VASP job ran on Expanse.The Stampede2 cluster had Intel Xeon Platinum 8160 "Skylake" processors with 48 cores per node.The Frontera cluster has Intel 8280 "Cascade Lake" processors with 56 cores per node.We used one full node for each VASP job ran on Stampede2 and Frontera.For these calculations, we used the following parallelization settings in VASP: LPLANE = TRUE, NCORE = 12 (for Expanse and Stampede2) or NCORE = 14 (for   Frontera), LSCALU = FALSE, and NSIM = 4. NCORE species the number of cores in a group that work on the same orbital. 118or a job running on 48 cores, specifying NCORE = 12 yields 4 groups with 12 cores per group.Peak memory is not the same as required memory.Required memory is dened as the minimum amount of memory a soware program needs to successfully complete a job.Required memory is obviously less than or equal to peak memory.However, the required memory could be signicantly smaller than the peak memory, because a soware program may use more memory when it is available but not necessarily require this optional memory to successfully complete a job.
In Table 5, the listed time and memory for the exibility parameters calculation used average equilibrium values and included bond-bond cross terms.Since the exibility parameters calculation (using our SAVESTEPS Python code) ran on a single computing core, its total core hours was simply the elapsed wall time for that job.For these jobs, we computed the required memory as follows.In the batch script that was submitted to the job scheduler for the Expanse cluster at SDSC, we requested that a specic amount of memory be set aside for running the batch job.We submitted multiple such batch jobs that were identical except they requested different amounts of memory.Jobs that requested too little memory failed due to an out-of-memory error.If a job failed due to an out-of-memory error, we submitted a new job that requested more memory.If a job completed successfully, we submitted a new job that requested less memory.In this way, we found the minimum amount of memory (i.e., the required memory) that had to be requested in order for the job to complete successfully.
As expected, Table 5 shows an average trend of increasing computational time and memory as the number of atoms in the MOF's unit cell increased.However, there are some uctuations about this average trend in which a specic MOF may require more computational time or memory than a slightly larger MOF.As expected, the quantum chemistry calculations required orders of magnitude more core hours than the exibility parameters calculation.The required memory for the exibility parameters calculation was never higher than the peak memory for the quantum chemistry calculations.In other words, the exibility parameters calculations were less resource intensive than the quantum chemistry calculations.

How transferable are the force constant values?
To investigate the question of how transferable the optimized force constant values are between different structures, we compared results for a pair of MOFs having different crystal structure phases but the same reduced chemical formulas.Fig. 31 shows the crystal structures of the rst pair (CEGDUO   Table 6 summarizes the numbers of total and matched types for each of these two pairs.A stretch, bend, or dihedral type was considered 'matched' if it was comprised of the same atoms in the same order in both crystal structure phases.The number of 'matched types (within 3%)' satised the additional criterion that the equilibrium value of the corresponding internal coordinate differed by #3% between the MOF pair.For this comparison, the average equilibrium values (i.e., averaged over all instances of the same type within a particular MOF) were compared, and for dihedrals the absolute values of the dihedral instances for each type were averaged and compared (this conforms to exactly the same convention as used for all 'average equilibrium value' results presented in this article).This type matching does not have to be one-to-one.For example, symmetry breaking can produce the situation in which one type in the rst crystal structure matches two different types in the second crystal structure.
A stretch, bend, or dihedral type was considered 'unmatched' if it appeared in only one MOF of the pair but there was no corresponding type comprised of the same atoms in the same order in the other MOF.This situation could arise if the Fig. 33 Parity plots of stretch (top panels), bend (middle panels), and torsion (bottom panels) force constants between pairs of MOFs having the same reduced chemical formulas but different crystal structure phases.Data is only shown for the matched types that had average equilibrium values differing by #3% between the two MOFs.The left panels show results for CEGDUO/CEGFAW.The right panels show results for EBOBUV/ QIVYUR.For both pairs, the after-pruning results are plotted as blue squares.For EBOBUV/QIVYUR, the before-pruning results are plotted as solid black circles.
bond connectivity of atoms differed between the two crystal structures and/or different dihedrals were kept during dihedral pruning.Obviously, there is no notion of 'transferability' for types that are 'unmatched'.
Fig. 33 shows parity plots of stretch, bend, and torsion force constants for the matched types that had average equilibrium values differing by #3% between the two MOFs.From these results, the following conclusions can be drawn.First, the stretch force constant values were highly transferable and almost unchanged by dihedral pruning.Second, the bend force constant values were moderately transferable and almost unchanged by dihedral pruning.Before dihedral pruning, there was good but not great transferability of the torsion force constant values for matched types of the EBOBUV/QIVYUR pair.The torsion force constant values were highly impacted by dihedral pruning.Aer dihedral pruning, the torsion force constant values had poor transferability.

Molecular dynamics simulations to compute heat capacity and thermal expansion coefficient
To calculate the heat capacity (C p ) and volumetric thermal expansion coefficient (a) of IRMOF-1 and MIL-53(Ga), we performed molecular dynamics (MD) simulations using the RASPA soware package. 119Fig. 34 illustrates the crystal structures of these two MOFs.The MIL-53(Ga) system (refcode COMDOY) was modeled with a 2 × 3 × 3 supercell with periodic boundary conditions, while the IRMOF-1 system (refcode MIBQAR01) was modeled with a 2 × 2 × 2 supercell with periodic boundary conditions.The MD simulations used a 0.5 femtosecond time step, the Nose-Hoover thermostat with default settings, 120 and the barostat (with default settings) available in RASPA v2.The simulations were conducted in the NPT ensemble under a range of external temperatures (200, 300, and 400 K) and 1 atm pressure.We performed 100 000 equilibration cycles followed by 500 000 production cycles for MIL-53(Ga).We performed 50 000 equilibration cycles followed by 250 000 production cycles for IRMOF-1.Three different runs were performed at each condition and the results averaged.
We performed these simulations using two different force-elds.Forceeld # 1: we programmed Manz's new anglebending and dihedral-torsion model potentials into RASPA version 2. We used this modied RASPA version with our exibility models for the simulations.This forceeld did not include any Lennard-Jones parameters or atomic charges.Table 7 summarizes the types and instances of stretches, angles, and dihedrals in our exibility models for these two MOFs.Force-eld # 2: additionally, for IRMOF-1 we also used the exible forceeld developed by Dubbeldam et al. (DWES). 11The DWES forceeld included Lennard-Jones interactions and atomic charge interactions.
Table 8 summarizes the computed heat capacities for these materials.For IRMOF-1, both our exibility model and the DWES forceeld gave C p values in excellent agreement with the experimentally-measured value.For MIL-53(Ga), no experimentally-measured C p value was available.According to our calculations, the C p value for MIL-53(Ga) is predicted to be similar to but slightly higher than the C p value for IRMOF-1.
Table 9 compares different values for the volumetric thermal expansion coefficient a of IRMOF-1.For this material, the negative value of a is caused by ligand opping which increases with temperature and shortens the ligand end-to-end distance and lattice vector length. 11,122When using our exibility model for this material with the thermostat/barostat (with default Table 7 Summary of types and instances of stretches, angles, and dihedrals in our flexibility models for MIL-53(Ga) and IRMOF-1.The letters after the dot represent a different interaction type involving the same elements.The numbers in parentheses are the number of instances of that particular type

Stretches
Angles Dihedrals settings) in RASPA v2, the volumetric thermal expansion coef-cient a of IRMOF-1 was substantially over-estimated in magnitude compared to the experimentally-measured value.This is probably due to one of two possible reasons related to excessive oppiness of the ligands.It is possible (although not yet proved) that excessive oppiness of the ligands was caused by the omission of out-of-plane-distance (improper-dihedral) terms in our exibility model for this material.This issue will need to be studied in more detail in future work.Alternatively, it is possible (although not yet proved) that excessive oppiness of the ligands was caused by excessively large pressure and/or temperature uctuations introduced by the particular thermostat/barostat employed in these MD simulations.The choice of thermostat/barostat impacts the size of temperature/ pressure uctuations during MD simulations. 123At this time, different kinds of thermostats/barostats were not available to us for testing within the simulation code we used; consequently, this issue will need to be studied in more detail in future work.

Conclusions
In this work, we developed a new protocol (see Fig. 3) for tting the exibility parameters of a classical forceeld to quantummechanically-computed reference data.Our protocol uses the following functional form to describe bonded interactions: U flexibility = U bonds + U UB + U angles + U dihedrals + (U optional ) (90)   U bonds includes bond stretches between bonded neighbors.U UB includes Urey-Bradley interactions between a selected subset of second neighbors.In this work, we included Urey-Bradley interactions between diagonal corners of four-membered rings but not between other second neighbors.U angles includes angle bends for all bond angles except those contained in 3membered and 4-membered rings (our protocol discards angles in 3-membered rings, because their degrees of freedom are already covered by the bond stretches.Our protocol discards angles in 4-membered rings, because their degrees of freedom are already covered by the bond stretches and diagonal Urey-Bradley terms.).U dihedrals includes the aer-pruning dihedrals.If desired, bond-bond cross terms and/or other optional terms (U optional ) can be included in our protocol.
Some key benets of our SAVESTEPS protocol include the following: (1) It uses Manz's 50 ansatz for separating intracluster bonded interactions from intracluster nonbonded interactions.This separation ansatz allows the bonded interactions to be optimized up to and including second-order derivatives in the energy (i.e., harmonic approximation) without requiring any prior parameterization of the intracluster nonbonded interactions.
(2) When using Manz's separation ansatz, the 'resting value' of bond length, angle, or dihedral in each exibility term does not require special tting, because it equals the corresponding equilibrium value in the quantum-mechanically-computed optimized ground-state geometry. 50This allows the forceeld's bonded parameters to be optimized using linear regression instead of requiring nonlinear regression.
(3) The protocol is automated to facilitate its deployment across many materials.
(4) Using an automated procedure, symmetry-equivalent and near-symmetry-equivalent bonds, angles, or dihedrals are clas-sied together into the same type.All instances of the same type share the same force constant value.
(5) The selection of which internal coordinates and which exibility terms to include in the forceeld is performed in a way that preserves symmetry equivalency while reducing (but not eliminating) redundancy.Dihedral pruning is an important step in this selection process to reduce internal coordinate redundancy.
(6) Our protocol automatically classies dihedrals as: (a) non-rotatable if they are part of a ring, (b) hindered if they are not part of a ring but have limited range of motion, (c) rotatable if they are not part of a ring and have full range of motion, and (d) linear if they contain at least one linear equilibrium bond angle.
(7) Our protocol uses Manz's 50 potential energy models for angle bends, rotatable dihedral torsions, and linear-dihedral torsions.The potential energy for each rotatable dihedral type is modeled using a series expansion containing up to seven orthonormal modes, and only those modes making a signicant contribution are selected for inclusion in the forceeld.These angle-bending and ADDT potential models provide continuous energy derivatives (i.e., forces) even as the bond angle approaches linearity.(8) Our protocol optimizes force constant values using a training dataset.This optimization is performed using a regularized linear least squares tting based on the LASSO method with bounds on some force constants.This resolves the multicollinearity problem and also zeros out unnecessary force constants.
(9) Our protocol ensures that every independent degree of freedom of atom-in-material motion is sampled in the training dataset by including both nite-displacement (aka 'Hessian') geometries and AIMD geometries in the force training dataset.This was done while holding the unit cell's size and shape xed at the experimental values (as pointed out in Section 6.2, it is also possible to apply our protocol to reference geometries that use quantum-mechanically-computed lattice vectors instead of experimentally-measured lattice vectors).
(10) Our protocol ensures each rotatable dihedral type is adequately sampled by performing a series of quantum chemistry calculations across the full range of this dihedral's values.These rotatable dihedral energy scans are included in the training dataset.
(11) Our protocol includes a validation step that veries the optimized exibility parameter values accurately reproduce atom-in-material forces across brand new geometries (generated via AIMD) that were not used in the training set.Key statistical parameters including R-squared and RMSE are computed for the validation dataset.R-Squared and RMSE values are also computed and reported for individual atoms in the material to help identify if and where the forceeld needs to be improved.(12) When the equilibrium values are set individually for each instance of a type, each exibility term we used is dened such that U term = 0 and vU term /v(IC) = 0 at the optimized ground-state reference geometry, where IC is a corresponding internal coordinate of that exibility term.For this optimized ground-state geometry, all atom-in-material forces are identically zero for both the quantum-chemistry level of theory used in the training dataset and also for the classical forceeld produced by our optimization protocol.Moreover, U exibility = 0 at this optimized ground-state geometry.
Using this protocol, we constructed and optimized exibility parameters for 116 MOFs.For each MOF, this method's accuracy was assessed by computing the R-squared and RMSE values for a set of 991 geometries in each validation set: 990 new AIMDgenerated geometries that were not used in the training set, plus the optimized geometry.Even without cross terms, the exibility model yielded R-squared values of 0.910 (avg across all MOFs) ± 0.018 (st.dev.) for atom-in-material forces in the validation datasets.This is excellent performance.When bondbond cross terms were included, the exibility model yielded Rsquared values of 0.928 (avg across all MOFs) ± 0.015 (st.dev.) for atom-in-material forces in the validation datasets.
Finally, we note some choices in the types of exibility terms included in our protocol.In this work, we used Urey-Bradley 72 stretches only for the diagonals of 4-membered rings.It is possible to incorporate additional Urey-Bradley terms in our protocol to augment or replace some of the angle-bending interactions.In this work, we compared exibility models with and without bond-bond cross-terms.As evident from the statistics listed in the prior paragraph, including bond-bond cross terms produced only a small overall improvement in accuracy.Other types of cross terms (e.g., bond-angle, angleangle, etc.) could be explored. 5,79,130Such cross terms could be included in our protocol on an as-needed basis to further improve accuracy.In this work, our protocol used a harmonic bond stretch potential.If desired, anharmonic bond stretching terms could be included to improve accuracy. 6,8,80,131,132Our general philosophy is that improper-dihedrals and out-of-planedistances are not required to construct an accurate exible forceeld, because these degrees of freedom are already covered by linear combinations of bonds, angles, and proper dihedrals already used in the force eld.Though not required, cross terms, 5,79,130 anharmonic terms, 6,8,80,131,132 improper-dihedrals, out-of-plane distances, and other renements could be included in our protocol.Such tweaks to the exibility terms could further improve accuracy at the expense of slightly increased computational cost and complexity.
We believe this protocol should nd widespread applications for developing classical non-reactive exible forceelds for nanoporous solids, small molecules, and other materials.The automated nature of this protocol facilitates deployment across large numbers of materials.The protocol is concise and computationally efficient without gratuitous oversimplication.
In Section 9, we investigated the question of force constant transferability for similar internal coordinate types appearing in two different chemical structures.For matched types with equilibrium values within 3%, the stretch and bend force constant values exhibited good transferability between different chemical structures.For matched types with equilibrium values within 3%, the torsion force constants exhibited medium transferability before dihedral pruning but poor transferability aer dihedral pruning.
In Section 10, we presented molecular dynamics calculations of the heat capacity and volumetric thermal expansion coefficient for IRMOF-1 and MIL-53(Ga).This demonstrates utility of our framework exibility models for computing some bulk thermodynamic properties of MOFs.We recommend that future work explore the calculation of bulk thermodynamic and mechanical properties in more detail.We recommend that future work compare results using different thermostats, barostats, and ensembles to better understand the effects of computational settings on the computed bulk property values.Specically, future work should try to resolve the question of whether the overestimation of volumetric thermal expansion coefficient magnitude for IRMOF-1 was due to an inaccuracy of our exibility model for this material (e.g., neglect of out-ofplane/improper torsion terms in our exibility model for this material) or due to excessive uctuations introduced by the particular thermostat/barostat that was used in the molecular dynamics simulations.
We also recommend that future work explore the computation of bulk modulus and elastic constants for MOFs using our exibility models.This will require a detailed analysis of

Paper
2][113] Bulk modulus values are sometimes theoretically estimated by tting an equation of state to simulated energy versus volume curves at absolute zero temperature neglecting the zero-point vibrational energy. 15,111owever, due to the ligand oppiness that increases with temperature, that approach may not be accurate for estimating the bulk modulus of IRMOF-1 (and other MOFs with oppy ligands) near ambient temperatures.On the other hand, computing the bulk modulus using MD simulations in the NPT ensemble introduces challenges because the magnitude of volume uctuations is strongly impacted by the choice of barostat. 112,1335][136] This issue is beyond the scope of the present work, and we recommend that it be explored in future studies.

Fig. 1
Fig. 1 Types of bonded interactions studied in this work.

Fig. 5
Fig.5Flowchart for generating lists of stretch types and instances, angle types and instances, and dihedral types and instances.

Fig. 6
Fig. 6 Example format for a stretch instance (a), stretch type (b), and ball-and-stick illustration (c).

Fig. 7
Fig. 7 Panels (a) and (b): illustration of bonds comprised of the same order of atom types but having dramatically different equilibrium bond lengths.Shown here are the PBE (ref.99) + D3 (ref.100)/aug-cc-pvtz optimized geometries (which we computed using Gaussian software 101 ) of Li 3 and Na 3 clusters that exhibit Jahn-Teller distortion.Panels (c) and (d): illustration of angles comprised of the same order of atom types, and comprised of the same order of bond types, but having dramatically different equilibrium angle values.This proves that defining unique angle types cannot be based solely on the underlying atom types and bond types but also must consider the angle's equilibrium value.Shown here is a ball and stick model of sulfur hexafluoride (SF 6 ).Selected angles are highlighted using navy as the color of the first atom, white as the color of the center atom, and cyan as the color of the last atom.

Fig. 8
Fig. 8 Example format for an angle instance (a), angle type (b), and ball-and-stick illustration (c).

Fig. 10
Fig. 10 Illustration of proper dihedrals involving atoms A, B, C, and D. Panels (i) and (ii) show examples of dihedrals that were used in our flexibility model.Panel (i) shows a dihedral in which neither contained equilibrium bond angle is linear.Panel (ii) shows a dihedral in which one of the contained equilibrium bond angles is linear.Panels (iii) to (v)show examples of dihedrals that were not used in our flexibility model.The 3-member ring in panel (iii) is already described by the bond lengths, so no corresponding dihedral term in the flexibility model is required.The 4-member ring in panel (iv) is already described by its six stretches: AB, BC, CD, AD, AC, and BD.In panel (v), both the ABCD and the EBCD dihedrals were excluded, because the BCD angle is inside a 4-membered ring.

Fig. 11
Fig. 11 Illustration showing why the smallest bond path cycle passing through a particular bond cannot contain more than four translated images of the same atom.Please see the text for a detailed description.

Fig. 12
Fig.12Illustration of coupled dihedral types for ball and stick model of ethane (C 2 H 6 ).Here, selected dihedrals are highlighted using blue as the color of the first atom and cyan as the color of the last atom.We can define two dihedral types for ethane with different absolute values of dihedral angles: 180°(left panel) and 60°(right panels).These dihedrals have the same C-C as their middle bond.To construct a concise forcefield, we can use either dihedral type and discard the other.

7. 3
Dening SSE, SST, R-squared, and RMSE Optimizing the exibility model (i.e., force constant values) to the combined training dataset yields the matrix b containing optimized force constants values.As explained in the following section, some of the force constants may have values equal to zero (aka 'eliminated').

Fig. 14 A
Fig. 14 A stacked histogram showing the number of MOFs with different unit cell sizes as quantified by the number of atoms per unit cell.The total number of MOFs was 116, and we optimized flexibility parameters for all these.

Fig. 16
Fig. 16 Stacked histograms showing the number of MOFs containing various numbers of active internal coordinate types (left panels) and active internal coordinate instances (right panels).The top panels are for bond and UB stretches.The middle panels are for angle bends.The bottom panels are for dihedral torsions after dihedral pruning.

Fig. 17
Fig. 17 Histograms showing the distribution of the number of rotatable dihedral types per MOF (left panel) and the distribution of the number of rotatable dihedral instances per MOF (right panel).These results are after dihedral pruning.

Fig. 20
Fig. 20 Parity plot showing the number of dihedral instances after pruning compared to before pruning in quadrant 1 MOFs.

Fig. 19
Fig. 19 Left panel: Histogram showing the number of smart-selected torsion modes in each rotatable dihedral type.Right panel: Histogram showing which rotatable dihedral modes were smart selected.

Fig. 22
Fig. 22 Histogram of difference between R-squared before dihedral pruning and R-squared after dihedral pruning for MOFs in quadrant 1.

Fig. 28
contains raincloud plots showing the distribution of R-squared and RMSE values for rotatable dihedrals training, forces training, and validation datasets for MOFs in quadrants 3 and 4 without bond-bond cross terms.All of these R-squared values were greater than 0.85, and all of the median R-squared values were well above 0.90.The median R-squared value was ∼0.99 for the rotatable dihedrals training dataset, and this attests to an outstanding protocol.R-Squared values for the forces training and validation datasets were similar, and this demonstrates the exibility models did not have overtting.The RMSE values for rotatable dihedrals training, forces training, and validation datasets were reasonable and did not have many outliers.

Fig. 24
Fig. 24 Parity plot showing computational time for flexibility parameters calculation when the flexibility model was constructed with dihedral pruning compared to without dihedral pruning.These computational times include optimizing force constant values, computing statistics for the training datasets, and computing statistics for the validation datasets.These computational times do not include the times for quantum chemistry calculations to prepare the training and validation datasets.

Fig. 25
Fig. 25 Histogram of difference between R-squared for l / 0 and R-squared for l = l best for rotatable dihedrals training dataset (left panel) and forces training dataset (right panel) in MOFs belonging to quadrants 3 and 4.

Fig. 26
Fig.26Histograms of force constants eliminated by the bounds or regularization constraints in the LASSO method applied to MOFs in quadrants 3 and 4.These histograms show results after dihedral pruning using average (top panels) and individual (bottom panels) equilibrium values with bond-bond cross terms (left panels) and without bond-bond cross terms (right panels).

Fig. 27
Fig. 27 Histogram of difference between R-squared with bond-bond cross terms and R-squared without bond-bond cross terms for the validation dataset in MOFs from quadrants 3 and 4.

Fig. 28
Fig. 28 Raincloud plots of R-squared (left panels) and RMSE (right panels) for rotatable dihedrals training (top panels), forces training (central panels), and validation (bottom panels) for MOFs in quadrants 3 and 4 without cross terms and after pruning.The red distributions represent the values for individual equilibrium values, while the blue distributions represent the values for average equilibrium values.
, [200,299], [300,399], [400,499].The quantum chemistry computational costs included: (a) optimizing the MOF's geometry (atom-in-material positions) while holding the lattice vectors constant at their experimental values, (b) AIMD calculations for training dataset, (c) nitedisplacement 'Hessian' geometries for training dataset, (d) single-point energy calculations for a rigid torsion scan for one instance of each rotatable dihedral type (if any were present in the MOF), (e) AIMD calculations for validation dataset.For each MOF, the total core hours for quantum chemistry calculations was computed as follows total_core_hours ¼

Fig. 29
Fig. 29 Raincloud plots showing the distribution of atom-wise R-squared and atom-wise RMSE (eV Å −1 ) values for atom-in-material forces in the validation datasets for OGIBUD (top panels) and HEBZEV (bottom panels).Results are plotted for individual (red) and average (blue) equilibrium values.

Fig. 30
Fig. 30 Plots of computational time for flexibility parameters calculation versus number of atoms in the MOF's unit cell.These computational times include optimizing force constant values, computing statistics for the training datasets, and computing statistics for the validation datasets.These computational times do not include the times for quantum chemistry calculations to prepare the training and validation datasets.The top panels are for 79 MOFs in quadrants 1 and 2. The bottom panels are for 37 MOFs in quadrants 3 and 4. The left (right) panels are for computations with (without) bond-bond cross terms.

Fig. 32
Fig.32The pair of MOFs EBOBUV and QIVYUR having different crystal structure phases but the same reduced chemical formulas.

Fig. 31
Fig.31The pair of MOFs CEGDUO and CEGFAW having different crystal structure phases but the same reduced chemical formulas.

Fig. 34
Fig. 34 The crystal structures of MIL-53(Ga) (refcode COMDOY) and IRMOF-1 (refcode MIBQAR01). 8,[74][75][76][77][78][79] N training_parts is the number of separate parts in the training dataset for which R-squared values are computed.Specically, N training_parts = 1 if no rotatable dihedrals are present, and N training_parts = 2 if any rotatable dihedrals are present.Note that R combined_training 2 is the average of R-squared values for the training parts.Examining eqn (79) and (80), this choice for u i maximizes the sum of R-squared values for the forces training and rotatable dihedrals training datasets subject to the applied constraints.

Table 3
Summary of training and validation statistics for MOFs in quadrants 1 and 3.The fourth column indicates whether bond-bond cross (bbc) terms were included.Each numeric entry is the average ± standard deviation

Table 4
Summary of performance statistics for OGIBUD and HEBZEV.The results displayed outside (inside) parentheses represent outcomes from models optimized with (without) bond-bond cross terms

Table 5
Total computational time and memory for: (i) quantum chemistry calculations (using VASP) to compute the training and validation datasets and (ii) flexibility parameters calculation using our SAVESTEPS Python code.Please see the main text for how the peak memory and required memory were defined

Table 6
Number of total and matched types for pairs of MOFs before pruning (BP) and after pruning (AP) of dihedrals.For 'matched types (within 3%)', the equilibrium value of the corresponding internal coordinate differed by #3% between the MOF pair.For 'matched types of different value', the equilibrium value of the corresponding internal coordinate differed by >3% between the MOF pair

Table 8
Comparison of heat capacities at 1 atm and 300 K of different MOFs.BP = before dihedral pruning; AP = after dihedral pruning MOF/forceeld used C p (J g −1 K −1 )

Table 9
Comparison of volumetric thermal expansion coefficient a for IRMOF-1 in the range 200-400 K. BP = before dihedral pruning; AP = after dihedral pruning